<p align="center"><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="260" height="110" /></p>

---
# **Table of Contents**
---

1. [**Introduction**](#Section1)<br>
2. [**Problem Statement**](#Section2)<br>
3. [**Installing & Importing Libraries**](#Section3)<br>
  3.1 [**Installing Libraries**](#Section31)<br>
  3.2 [**Upgrading Libraries**](#Section32)<br>
  3.3 [**Importing Libraries**](#Section33)<br>
4. [**Data Acquisition & Description**](#Section4)<br>
5. [**Data Pre-Profiling**](#Section5)<br>
6. [**Data Pre-Processing**](#Section6)<br>
7. [**Data Post-Profiling**](#Section7)<br>
8. [**Exploratory Data Analysis**](#Section8)<br>
9. [**Summarization**](#Section9)</br>
  9.1 [**Conclusion**](#Section91)</br>
  9.2 [**Actionable Insights**](#Section91)</br>

---

---
<a name = Section1></a>
# **1. Introduction**
---
- In this analysis we are using a survey data set, that has captured data regarding unique user id's, their age, their tenure on facebook, the number of likes given, the number of likes recieved, how the likes recieved vary on the web app and on the mobile device, how the friend count varies with age etc.
- Through this analysis our main intention is to see if there is any pattern that can help facebook improve its mobile and web features, which age groups it should target more, and how social media can be leveraged by people of different age groups for social media marketing, for popularizing one's personal brand and to be on top of the game at all times.

---
<a name = Section2></a>
# **2. Problem Statement**
---

<p align="center"><img src="https://chi2016.acm.org/wp/wp-content/uploads/2016/02/Facebook-06-2015-Blue.png"></p>

- Derive a scenario related to the problem statement and heads on to the journey of exploration.

- **Example Scenario:**
  - Facebook, Inc. is an American social media conglomerate founded in 2005 by Mark Zuckerberg with 2 other classmates from Harvard University. People from different countries, ethinicities, cultures, age groups, interact and use facebook on a daily basis. The company however wants to increase the opportunity for users to use Facebook for social media campaigns, for small and big businesses and also enable users to generate effective leads. As a result, they have collected usage data for different age groups, the device they use , the likes they recieve and give and also the number of friends they have or the requests they have initiated. This will help facebook personalise the experience of users more, help them track patterns that will increase enagagement, improve app features and/or website features etc.
  - To tackle this problem, Facebook has hired a data engineer to find patterns and derive meaningful insights.
  

---
<a id = Section3></a>
# **3. Installing & Importing Libraries**
---

- This section is emphasised on installing and importing the necessary libraries that will be required.

### **Installing Libraries**

In [None]:
!pip install -q datascience                                         
!pip install -q pandas-profiling                           

### **Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync.

- Make sure not to execute the cell under Installing Libraries and Upgrading Libraries again after restarting the runtime.

In [None]:
!pip install -q --upgrade pandas-profiling

### **Importing Libraries**

- You can headstart with the basic libraries as imported inside the cell below.

- If you want to import some additional libraries, feel free to do so.


In [None]:
import numpy as np
np.set_printoptions(precision=4)                    # To display values only upto four decimal places. 

import pandas as pd
pd.set_option('mode.chained_assignment', None)      # To suppress pandas warnings.
pd.set_option('display.max_colwidth', -1)           # To display all the data in each column
pd.options.display.max_columns = None                 # To display every column of the dataset in head()

import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')                  # To apply seaborn whitegrid style to the plots.
%matplotlib inline

import seaborn as sns
sns.set(style='whitegrid', font_scale=1.3, color_codes=True)      # To adjust seaborn settings for the plots.

import warnings
warnings.filterwarnings('ignore')                   # To suppress all the warnings in the notebook.

In [None]:
from pandas.plotting import parallel_coordinates

In [None]:
!pip install plotly --upgrade

In [None]:
!pip install chart-studio

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style='whitegrid', font_scale=1.3, color_codes=True)

In [None]:
import sys
!conda install --yes --prefix {sys.prefix} plotly
#pip install plotly
import plotly.graph_objs as go

In [None]:
import pandas_profiling

---
<a name = Section4></a>
# **4. Data Acquisition & Description**
---

- This section is emphasised on the accquiring the data and obtain some descriptive information out of it.

- You could either scrap the data and then continue, or use a direct source of link (generally preferred in most cases).

- You will be working with a direct source of link to head start your work without worrying about anything.

- Before going further you must have a good idea about the features of the data set:

|Id|Feature|Description|
|:--|:--|:--|
|01| userid                 | A numeric value uniquely identifying the user.|
|02| age                    | Age of the user in years.|
|03| dob_day                | Day part of the user's date of birth.|
|04| dob_year               | Year part of the user's date of birth.| 
|05| dob_month              | Month part of the user's date of birth.|
|06| gender                 | Gender of the user.| 
|07| tenure                 | Number of days since the user has been on FB.|
|08| friend_count           | Number of friends the user has.|
|09| friendships_initiated  | Number of friendships initiated by the user.|
|10| likes                  | Total number of posts liked by the user.|
|11| likes_received         | Total Number of likes received by user's posts.|
|12| mobile_likes           | Number of posts liked by the user through mobile app.|
|13| mobile_likes_received  | Number of likes received by user through mobile app.|
|14| www_likes              | Number of posts liked by the user through web.|
|15| www_likes_received     | Number of likes received by user  through web.| 


In [None]:
fb = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Projects/facebook_data.csv")
fb.head()

### **Data Description**

- To get some quick description out of the data you can use describe method defined in pandas library.

In [None]:
fb.describe()

### **Data Information**

In [None]:
fb.info()

---
<a name = Section5></a>
# **5. Data Pre-Profiling**
---

- This section is emphasised on getting a report about the data.

- You need to perform pandas profiling and get some observations out of it...

In [None]:
profile = fb.profile_report(title="Data Visualization Report for Facebook Data", progress_bar=False, minimal=False)
profile.to_file(output_file="exploratory_analysis.html")
profile

In [None]:
## Observations
# Minimum age is 13 and maximum age is 113. 
# The number of males in the data set is higher than the number of females.
# The mean tenure is around 537 days.
# The number of friends vary drastically with a maximum of 4923 friends.
# The number of friendships initiated has a mean value of 107, but the graph is again very skewed.
# The number of mobile likes recieved seem to be significantly higher than the www likes indicating more app usage.

---
<a name = Section6></a>
# **6. Data Pre-Processing**
---

- This section is emphasised on performing data manipulation over unstructured data for further processing and analysis.

- To modify unstructured data to strucuted data you need to verify and manipulate the integrity of the data by:
  - Handling missing data,

  - Handling redundant data,

  - Handling inconsistent data,

  - Handling outliers,

  - Handling typos

In [None]:
fb.isna().sum()

In [None]:
fb['gender'] = fb['gender'].fillna('Not sure')

In [None]:
fb['tenure'] = fb['tenure'].fillna(0)

In [None]:
fb.isna().sum()

In [None]:
fb.sample(20)

In [None]:
fb.info()

In [None]:
fb['tenure']= fb['tenure'].astype(int)

In [None]:
fb.info()

---
<a name = Section7></a>
# **7. Data Post-Profiling**
---

- This section is emphasised on getting a report about the data after the data manipulation.

- You may end up observing some new changes, so keep it under check and make right observations.

In [None]:
profile = fb.profile_report(title="Data Visualization Report for Facebook Data", progress_bar=False, minimal=False)
profile.to_file(output_file="exploratory_analysis.html")
profile

---
<a name = Section8></a>
# **8. Exploratory Data Analysis**
---

- This section is emphasised on asking the right questions and perform analysis using the data.

- Note that there is no limit how deep you can go, but make sure not to get distracted from right track.

In [None]:
## Highest and Lowest age group in the data set and their counts.
fb['age'].iloc[fb['age'].argmax()]

In [None]:
fb['age'].iloc[fb['age'].argmin()]

In [None]:
##Show the spread of data for age group

fb['age'].plot(kind = 'kde')

In [None]:
## Total Number of Females, Males and those in 'Not sure' category in Each age group
fb.groupby(['age'])['gender'].value_counts()

In [None]:
fb['age_group'] = pd.cut(fb['age'],[0,19,25,35,45,99],labels=['Less than 18','19-24','25-34','35-44','45+'],include_lowest=True )
fb['age_group'].value_counts()

In [None]:
fb['age_group'].value_counts().plot(kind='bar',figsize=(10,5),colormap='rainbow',fontsize=13,yticks=np.arange(0,19,2))
plt.title('Breakup of different age groups')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.savefig('breakdown-category.png')

In [None]:
fb[fb['gender' == 'Not sure']].value_counts()

In [None]:
## Show the spread of age group for male, female and not sure categories.
plt.figure(figsize=(15,8))
sns.boxplot(data=fb, x='gender', y='age', palette='rainbow')

In [None]:
## Age Groups and their Tenure

fb.groupby(['age_group'])['tenure'].mean().sort_values(ascending = False)

In [None]:
## Find the most popular males 

fb_malepopular=fb[fb['gender'] == 'male'][['age','likes_received']].sort_values('likes_received',ascending=False)[:10]
fb_malepopular.plot(kind='scatter', x='age', y='likes_received', figsize=(15, 7), color='blue', grid=False)

In [None]:
## Find the most popular females

fb_femalepopular=fb[fb['gender'] == 'female'][['age','likes_received']].sort_values('likes_received',ascending=False)[:10]
fb_femalepopular.plot(kind='scatter', x='age', y='likes_received', figsize=(15, 7), color='blue', grid=False)

In [None]:
## Top 10 age groups with maximum tenure.

fb.groupby(['age'])['tenure'].mean().sort_values(ascending = False)[:10]

In [None]:
### Is there a relationship between number of friends and number of likes recieved ?
plt.figure(figsize=(15,8))
sns.scatterplot(data=fb, x='friend_count', y='likes_received', hue = 'gender')

## There is no relationship between friend count and likes recieved. As the friend count increases there seems to be a slight uphill trend. There are a few outliers at >250000 likes for a lesser friend count. These are probably social media influencers , bloggers, or any othe celebrity figure.

In [None]:
### Is there a relationship between friendships initiated and number of likes recieved ?

plt.figure(figsize=(15,8))
sns.regplot(data=fb,x='friendships_initiated',y='likes_received',color='green')

### For Friendships initiated of less than 1000 , the number of likes recieved is high. However, it keeps decreasing with a very high number of friendships initiated.


In [None]:
###I want to see if there is a relationship between mobile likes given and age

plt.figure(figsize=(15,8))
sns.scatterplot(data=fb, x='age', y='mobile_likes', hue = 'gender')

#### People in the age group between 20 and 30 have the maximum likes given through mobile access.

In [None]:
###I want to see if there is a relationship between www likes given and age

plt.figure(figsize=(15,8))
sns.scatterplot(data=fb, x='age', y='www_likes', hue = 'gender')

#### People in the age group between 18 and 24 have the maximum likes given through www access and females seem to be more addicted.

In [None]:
### Is there a relationship between age and likes recieved ?

plt.figure(figsize=(15,8))
sns.scatterplot(data=fb, x='age', y='likes', hue = 'gender')

### Younger females and males seem to recieve a higher number of likes than the rest.

In [None]:
###I want to see if there is a relationship between likes recieved and gender
fb.groupby(['gender'])['likes_received'].mean().plot(kind ='bar', figsize=(15,7), color = 'red')
### Females have a higher popularity on Facebook.

In [None]:
## I want to examine the relationship between multiple continuous variables- mobile_likes, www_likes,mobile_likes_recieved and www_likes_recieved.

import seaborn as sns
plt.figure(figsize=(20,10))
sns.pairplot(fb[['mobile_likes','www_likes','mobile_likes_received','www_likes_received','gender']],hue='gender',diag_kind='kde')

In [None]:
import seaborn as sns
plt.figure(figsize=(20,20))
sns.pairplot(fb[['mobile_likes','www_likes','age_group']],hue='age_group')

### There is a negative relationship between www likes and mobile likes given. This indicates that most individuals access facebook using only one device.

In [None]:
# Show me the corelation between differnt variables:
fb.corr()
plt.figure(figsize=(20,10))
sns.heatmap(fb.corr(),cmap='plasma',annot=True)    

In [None]:
#Analysis:
#There is a high corelation between friend count and friendships initiated.
#There is a high corelation between mobile likes given and total number of likes given, and low corelation between the www likes given and the total likes given.
#There is a high corelation between mobile likes recived and total number of likes recieved, which is slightly greater than the corelation between www likes revieved and total likes recieved.

---
<a name = Section9></a>
# **9. Summarization**
---

- We first went ahead and used pandas profiling to check the different characteristics of the data set. 
- After initial observations , we moved ahead with data cleaning. Here we removed Na values and replaced them with   appropriate values.
- We then performed post profiling to observe any changes.
- Then , we went ahead and conducted EDA. Here we added one more column called age_group.
- We then based our observations on the age_group division, gender division, likes recieved and given, populatity of females and males, their usage patterns, the people who are most addicted and so on.
- We found that females are the most popular and people of the age group 20-24 years have the highest usage pattern. 
- There are some outliers , and these are mostly males who have very high popularity.
- There is no relationship between friend count and likes recieved. As the friend count increases there seems to be a slight uphill trend. There are a few outliers at >250000 likes for a lesser friend count. These are probably social media influencers , bloggers, or any other celebrity figure.
- There is no relationship between friend count and likes recieved. As the friend count increases there seems to be a slight uphill trend. There are a few outliers at >250000 likes for a lesser friend count. These are probably social media influencers , bloggers, or any othe celebrity figure.

<a name = Section91></a>
### **9.1 Conclusion**

- Through EDA and associated visualisations we relaised that the age group which uses Facebook more actively is between 20- 30 years of age and they use the mobile app more frequently. Web usage however is spread over a wider age group implying that social media features on the website are more user friendly, easy to use and interactive. Further analysis can be conducted to identify the different pages they visit , the businesses they have setup and are active on, and also how social media campaigns have gathered pace over the last few years. This will help Facebook improve their user base and also help achieve the target of increasing Facebook usage commercially.


<a name = Section92></a>
### **9.2 Actionable Insights**


- Highest and Lowest age groups in the dataset are 113 and 13.
- Total number of Females, Males , Not sure category in each age group.
- Highest number of items are in the age group of 45+, and minimum is in the age group of 35-44
- The female age group has a higher spread of data than males and the central tendency is also higher.
- MAles in the age group of 18-24 are the most popular with a substantially high number of likes recieved, while for females the spread is between 11 to 20 years.
- For Friendships initiated of less than 1000 , the number of likes recieved is high. However, it keeps decreasing with a very high number of friendships initiated.
- People in the age group between 20 and 30 have the maximum likes given through mobile access.
- People in the age group between 18 and 24 have the maximum likes given through www access and females seem to be more addicted.
- Younger females and males seem to recieve a higher number of likes than the rest. - thus implying age does have an impact on popularity.
- There is a negative relationship between www likes and mobile likes given. This indicates that most individuals access facebook using only one device.
- There is a high corelation between friend count and friendships initiated.
- There is a high corelation between mobile likes given and total number of likes given, and low corelation between the www likes given and the total likes given.
- There is a high corelation between mobile likes recived and total number of likes recieved, which is slightly greater than the corelation between www likes revieved and total likes recieved.
- Data regarding the different pages visited by users, along with frequency and time of the day could have helped us in our goal of increasing facebook usability to improve social media marketing.

### **9.3 Suggestions**

- The spread of data for females is between 20 and 55 years of age and for males it is between 20 to 45 years of age. Hence, this is the spread that we should be focussing our efforts on.
- Younger males seem to be more enagaged hence further analysis can be done to identify their behavioural patterns and the content they enagage with , implying higher social media presence.
- Organic reach has dropped on almost every social media platform in recent years. However, accounts with higher social media engagement are the least affected. In fact, Facebook uses “meaningful engagement” as an important signal that a post should be prioritized. Hence people with higher number of posts/ likes given should be prioritized more.
- Popular males and females belonging to a certain age group ( 18-22 ) approx have the highest number of likes recieved. Hence , closely analysing their social media presence with time of the day,the different pages they have created, their personalities etc can be tracked and social media analytics can play a vital role in it.
- Usage of mobile device to access facebook is high for people in the age group of 18 to 30. However, for high number of likes given through the web, seems to be spread till the age of 65-70 years. Therefore it is inevitable that the web access is easier or much more comfortable for a larger population than the mobile app. This implies that work needs to be done to make the mobile app easy to use with interactive features.
- In order to encourage users to interact with facebook more for social media marketing, it is important to gather further data on small and large business pages on facebook and how different individuals are leveraging facebook for it.