# Objective
***
We the data from the market - research study conducted by the consumer brand "NutrientH20" and using that, we need to identify any interesting market segments that seem to appear to stand out in the companys social audience so that the brand could hone its messaging a little more sharply.

# Approach
***
The input file contains, the number of posts by a given user that fell into each of the pre-specified scheme of 36 different categories, each representing a broad area of interest (e.g. politics, sports, family, etc.).
***
Using this data, our approach is going be
- At first, for each user, we are going to define the **"Major Category"**. This is basically going to be the specific category out of the pre-specified scheme of categories which has the *highest number of posts for the user.*
- Following this, we try to apply dimensionality reduction by using PCA i.e Principal Component Analysis coupled with t-SNE to visualize the data *(PCA is being applied here to improve performance for t-SNE)
- After applyting t-SNE and visualizing the data, we try to study the visualized data to extract meaningful insights from the same. We will try to see if there is ***any relationship between the users belonging to different Major Categories***

# Implementation

In [1]:
#import necessary packages/modules
import pandas as pd
import numpy as np

#read the input file
social_marketing=pd.read_csv('social_marketing.csv')
social_marketing.head()

Unnamed: 0.1,Unnamed: 0,chatter,current_events,travel,photo_sharing,uncategorized,tv_film,sports_fandom,politics,food,...,religion,beauty,parenting,dating,school,personal_fitness,fashion,small_business,spam,adult
0,hmjoe4g3k,2,0,2,2,2,1,1,0,4,...,1,0,1,1,0,11,0,0,0,0
1,clk1m5w8s,3,3,2,1,1,1,4,1,2,...,0,0,0,1,4,0,0,0,0,0
2,jcsovtak3,6,3,4,3,1,5,0,2,1,...,0,1,0,1,0,0,1,0,0,0
3,3oeb4hiln,1,5,2,2,0,1,0,1,0,...,0,1,0,0,0,0,0,0,0,0
4,fd75x1vgk,5,2,0,6,1,0,0,2,0,...,0,0,0,0,0,0,0,1,0,0


In [2]:
#look at the different categories present
print(social_marketing.columns)

Index(['Unnamed: 0', 'chatter', 'current_events', 'travel', 'photo_sharing',
       'uncategorized', 'tv_film', 'sports_fandom', 'politics', 'food',
       'family', 'home_and_garden', 'music', 'news', 'online_gaming',
       'shopping', 'health_nutrition', 'college_uni', 'sports_playing',
       'cooking', 'eco', 'computers', 'business', 'outdoors', 'crafts',
       'automotive', 'art', 'religion', 'beauty', 'parenting', 'dating',
       'school', 'personal_fitness', 'fashion', 'small_business', 'spam',
       'adult'],
      dtype='object')


In [3]:
#Create the column "major_category"
#This is the category out of the pre-specified scheme of categories which has the highest number of posts for the user.

social_marketing["major_category"]=social_marketing.iloc[:,1:len(social_marketing.columns)].apply(lambda row: row.idxmax(), axis=1)

In [4]:
#select the required columns for the actual analysis
social_marketing_analysis_df=social_marketing.iloc[:,1:len(social_marketing.columns)-1]

In [5]:
#Apply PCA to reduce the no of featurees from n = 36 to n = 25
from sklearn.decomposition import PCA

pca = PCA(n_components=25)
vectors_pca = pca.fit_transform(social_marketing_analysis_df)  
vectors_pca.shape

(7882, 25)

In [6]:
#Apply PCA to reduce the no of featurees from n = 25 to n = 2
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=1, perplexity=30, n_iter=1000)
vectors_tsne = tsne.fit_transform(vectors_pca)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 7882 samples in 0.000s...
[t-SNE] Computed neighbors for 7882 samples in 0.264s...
[t-SNE] Computed conditional probabilities for sample 1000 / 7882
[t-SNE] Computed conditional probabilities for sample 2000 / 7882
[t-SNE] Computed conditional probabilities for sample 3000 / 7882
[t-SNE] Computed conditional probabilities for sample 4000 / 7882
[t-SNE] Computed conditional probabilities for sample 5000 / 7882
[t-SNE] Computed conditional probabilities for sample 6000 / 7882
[t-SNE] Computed conditional probabilities for sample 7000 / 7882
[t-SNE] Computed conditional probabilities for sample 7882 / 7882
[t-SNE] Mean sigma: 1.909754
[t-SNE] KL divergence after 250 iterations with early exaggeration: 87.460487
[t-SNE] KL divergence after 1000 iterations: 2.017832


In [7]:
#prepare the data in a dataframe for the result analysis
df_tsne = pd.DataFrame(vectors_tsne, columns=["Component 1", "Component 2"])
df_tsne['Label'] = social_marketing["major_category"].values

df_tsne.head()

Unnamed: 0,Component 1,Component 2,Label
0,64.879898,-15.852285,health_nutrition
1,-21.351665,-41.795658,sports_fandom
2,-57.188782,-20.969036,art
3,-30.017269,-6.246262,current_events
4,2.024731,42.449257,photo_sharing


***
# Results
***
***
**1)**   At first, we try to plot observations, now been condensed to the 2D surface along with the *Major Category* as the labels

In [18]:
import plotly.express as px

fig = px.scatter(df_tsne["Component 1"], df_tsne["Component 2"], color=df_tsne['Label'], title="t-SNE visualization of Tweeter Users")
fig.update_layout(autosize=False, width=1000, height=1000)
fig.show()

Firstly, we see that **Chatter** as a Category as *close to almost all of the other catogories.* Thus it has no significant business value as it does not singularly suggest something unique.
***
Now, if we want to see the relationship for some of the categories, we can study the subset of data only related to them.
***
**2)** If we try to explore the relationship between the categories **"college_uni"** and **"online_gaming"**


In [19]:
custom_palette = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']
df_tsne_plot=df_tsne[df_tsne['Label'].isin(["college_uni","online_gaming"])]
fig = px.scatter(df_tsne_plot["Component 1"], df_tsne_plot["Component 2"], color=df_tsne_plot['Label'], color_discrete_sequence=custom_palette, title="t-SNE visualization of Tweeter Users")
fig.update_layout(autosize=False, width=500, height=500)
fig.show()

**College_Uni** and **Online_Gaming** have a *distribution close to each other* i.e. users tweeting about Online Gaming might also be tweeting about University life or college issue. This is a relation that is intuitive and understandable to us.
***
**3)** If we try to explore the relationship between the categories **"cooking"** and **"politics"**

In [20]:
custom_palette = ['#2ca02c','#ff7f0e', '#1f77b4', '#d62728', '#9467bd']
df_tsne_plot=df_tsne[df_tsne['Label'].isin(["cooking","politics"])]
fig = px.scatter(df_tsne_plot["Component 1"], df_tsne_plot["Component 2"], color=df_tsne_plot['Label'], color_discrete_sequence=custom_palette, title="t-SNE visualization of Tweeter Users")
fig.update_layout(autosize=False, width=500, height=500)
fig.show()

**Politics** and **Cooking** have a *distribution close to each other* i.e. users tweeting about Politics might also be tweeting about Cooking. This is not a relation that is intuitive to us as the two topics are not logically connected. 
Thus this finding is very interesting to us.
***
**4)** If we try to explore the relationship between the categories **"politics"** and **"sports_fandom"**

In [21]:
custom_palette = ['#9467bd' ,'#2ca02c','#ff7f0e', '#1f77b4', '#d62728']
df_tsne_plot=df_tsne[df_tsne['Label'].isin(["politics","sports_fandom"])]
fig = px.scatter(df_tsne_plot["Component 1"], df_tsne_plot["Component 2"], color=df_tsne_plot['Label'], color_discrete_sequence=custom_palette, title="t-SNE visualization of Tweeter Users")
fig.update_layout(autosize=False, width=500, height=500)
fig.show()

**Politics** and **Sport Fandom** have a *distribution far from each other* i.e. users tweeting about Politics not be tweeting about Politics at all. 
***
Thus, from the point of view of an advertiser, it might not be effective to place Political Ads in the feeds of users who would be placed in the Category of "Sport Fandom"
***
***
**5)** If we try to explore the relationship between the categories **"politics"** and **"photo_sharing"**

In [23]:
custom_palette = ['#ff7f0e', '#d62728','#9467bd' ,'#2ca02c', '#1f77b4']
df_tsne_plot=df_tsne[df_tsne['Label'].isin(["politics","photo_sharing"])]
fig = px.scatter(df_tsne_plot["Component 1"], df_tsne_plot["Component 2"], color=df_tsne_plot['Label'], color_discrete_sequence=custom_palette, title="t-SNE visualization of Tweeter Users")
fig.update_layout(autosize=False, width=500, height=500)
fig.show()

**Politics** and **Photo Sharing** have a *distribution close to each other*. Thus, it might be effective market a photo sharing product to users of the major categoty "Politics"

# Conclusion
***
The following are some of the observations from the analysis are as follows
- ***Chatter*** as a category does not offer any unique information and is thus of no business value
- ***College*** and ***Online Gaming*** as categories are close to each other
- ***Cooking*** and ***Politics*** as categories are close to each other. This is an interesting insights as it is not intrinsically intuitive.
- ***Politics*** and ***Sports Fandom*** as categories are far away from from each other
- ***Politics*** and ***Photo Sharing*** as categories are close to each other.