# LinkedIn Network Analysis
Author: [Richard Cornelius Suwandi](https://github.com/richardcsuwandi)

As an active user on [LinkedIn](https://www.linkedin.com/in/richardcsuwandi/) with more than 1000 connections, I was curious about the statistics of my network. In this project, I utilized exploratory analysis and data visualizations to gain insights from my own LinkedIn data.

## Data Preparation
First, let's import the necessary libraries for this project:

In [1]:
# Import the libraries
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

Next, we can load the data that is already downloaded as a `.csv` file. To download your own data, you can go [here](https://www.linkedin.com/help/linkedin/answer/50191/downloading-your-account-data?lang=en)

Note: Due to privacy issues, the data that is shown in this project might be slightly different from the original data.

In [4]:
# Load the data
df = pd.read_csv("Connections.csv")
df.head(10)

Unnamed: 0,First Name,Last Name,Company,Position,Connected On
0,Anastasia,Gorina,Prime Clerk,Data Analyst,22 Feb 2021
1,Michael,Duncan,DispatchHealth,Jr. Machine Learning Engineer,22 Feb 2021
2,Cansu,CANDAN,KTU Artificial Intelligence Society,Member Of The Management Board,22 Feb 2021
3,Richard,Pisano,Springboard,Data Analyst,22 Feb 2021
4,Arbak,Aydemir,Apple,AI/ML - Annotation Analyst,22 Feb 2021
5,Antonio,Boza León,"Waku casa de Software, C.A.",Científico de datos,22 Feb 2021
6,Abir Akhalak,Khan,Genesys International Corporation Ltd,Artificial Intelligence Engineer,22 Feb 2021
7,Aman,Malik,,,22 Feb 2021
8,Ömer,Peköz,SITEMARK,Data Processing Engineer,22 Feb 2021
9,Victor,Basu,Lumiq,Data Scientist,22 Feb 2021


The DataFrame above displays only my 10 latest connections on LinkedIn. The `Connected On` column indicates the date that I connect to that person.

In [5]:
# Describe the data
df.describe()

Unnamed: 0,First Name,Last Name,Company,Position,Connected On
count,1029,1029,952,952,1034
unique,908,902,800,674,223
top,Andrew,Kumar,"The Chinese University of Hong Kong, Shenzhen ...",Data Scientist,26 Aug 2020
freq,7,9,30,76,67


## Date Connected

Let's take a closer look on the `Connected On` column. But before that, we need to convert that column into a datetime format .

In [6]:
# Convert the 'Connected On' column to datetime format
df["Connected On"] = pd.to_datetime(df["Connected On"])
df["Connected On"]

0      2021-02-22
1      2021-02-22
2      2021-02-22
3      2021-02-22
4      2021-02-22
          ...    
1029   2020-05-21
1030   2020-05-21
1031   2020-05-21
1032   2020-05-21
1033   2020-05-16
Name: Connected On, Length: 1034, dtype: datetime64[ns]

Now, we can visualize the number of connections on a given date using Plotly's line plot.

In [7]:
# Create a line plot to visualize the number of connections on a given date
fig1 = px.line(df.groupby(by="Connected On").count().reset_index(), 
               x="Connected On", 
               y="First Name", 
               labels={"First Name": "Count"},
               title="Number of Connections on a Given Date")
fig1.show()

From the line plot above, we can see that there is a peak in the number of connections per day on 26 August 2020. It also seems that August 2020 is the period when I was the most active on LinkedIn.

## Company
> Which companies/organizations do the people in my network mainly come from?

To answer that question, we need to first group and sort the data based on the companies

In [8]:
# Group and sort the data by company 
df_by_company = df.groupby(by="Company").count().reset_index().sort_values(by="First Name", ascending=False).reset_index(drop=True)
df_by_company

Unnamed: 0,Company,First Name,Last Name,Position,Connected On
0,"The Chinese University of Hong Kong, Shenzhen ...",30,30,30,30
1,The Sparks Foundation,12,12,12,12
2,Perhimpunan Pelajar Indonesia (PPI) Tiongkok,12,12,12,12
3,Towards Data Science,12,12,12,12
4,Amazon,7,7,7,7
...,...,...,...,...,...
795,Hangzhou Indonesian Student Community,1,1,1,1
796,Happy Chinese World,1,1,1,1
797,Heap,1,1,1,1
798,Helium 10,1,1,1,1


Now that we have our data grouped and sorted based on the companies, we can visualize it using Plotly's bar plot

In [9]:
# Create a bar plot for the top companies
fig2 = px.bar(df_by_company[:20],
              x="Company",
              y="First Name",
              labels={"First Name": "Count"},
              title="Top Companies/Organizations in my Network")
fig2.show()

It worked just fine, but perhaps Plotly's [treemap](https://plotly.com/python/treemaps/) will do a better job in visualizing the companies in this case. 

In [8]:
# Create a treemap for the top companies
fig3 = px.treemap(df_by_company[:100], path=["Company", "Position"],
                 values="First Name",
                 labels={"First Name": "Count"})
fig3.show()

Using the treemap above, it is easier to compare the proportion of one company/organization to the others. It looks like the largest proportion of my network is from my university.

## Position
> What are the top common positions of people in my network?

To answer that question, we can create similar visualizations for the `Position` column

In [10]:
# Group and sort the data by position 
df_by_position = df.groupby(by="Position").count().reset_index().sort_values(by="First Name", ascending=False).reset_index(drop=True)
df_by_position

Unnamed: 0,Position,First Name,Last Name,Company,Connected On
0,Data Scientist,76,76,76,76
1,Machine Learning Engineer,18,18,18,18
2,Data Analyst,17,17,17,17
3,Founder,15,15,15,15
4,Data Science Intern,12,12,12,12
...,...,...,...,...,...
669,Economic Analyst,1,1,1,1
670,Editor,1,1,1,1
671,Editor In Chief,1,1,1,1
672,Editor and Writer,1,1,1,1


In [11]:
# Create a bar plot for the top positions
fig4 = px.bar(df_by_position[:20],
              x="Position",
              y="First Name",
              labels={"First Name": "Count"},
              title="Top Positions in my Network")
fig4.show()

In [11]:
# Create a treemap for the top positions
fig5 = px.treemap(df_by_position[:100], path=["Position", "Company"],
                 values="First Name",
                 labels={"First Name": "Count"})
fig5.show()

The top position in my network is data scientists, followed by machine learning engineers and data analysts. It is great to know that the top common positions in my network are my target group for networking.

In [12]:
# Find all positions that contains 'Data Scientist'
df["Position"].str.contains("Data Scientist").sum()

129

Wow, I didn't expect to see that many data scientists in my network! 

## Takeaways
It is always fun and interesting to analyze your own data as you might be surprised by what you see and learned something helpful. Personally, these treemaps made me realize that my LinkedIn network is much more diverse than I had thought.

Let's connect on [LinkedIn](https://www.linkedin.com/in/richardcsuwandi/)!