# Social Data from the Web

## Social Networking Communities

We have just discussed how clustering can be an effective tool to understand political behaviour. As an unsupervised learning technique it provides a new machine reading on party affiliations. Another popular application of clustering is detecting communities in social relationships.  Next we go through an example and dataset in Brett Lantz’s excellent book on Machine Learning (Lantz, B. (2013). Packt Publishing Ltd.). The dataset is discussed on pp. 279. It covers the relationships in a Social Networking Service (SNS). While this is a fairly early SNS dataset, it is freely available and offers similar kind of insights you can gain from my recent examples. 

This section also introduces you to the intersection of digital marketing techniques and sociological studies of online networks.

Lantz explains that the dataset was compiled for sociological research on teenage identities at Notre Dame University. It represents a random sample of 30,000 US high school students who had profiles on a well-known SNS in 2006. At the time the data was collected, the SNS was a popular web destination for U.S. teenagers. Therefore, it is reasonable to assume that the profiles represent a fairly wide cross section of American adolescents in 2006. The data was sampled evenly across four high school graduation years (2006 through 2009) representing the senior, junior, second-year and freshman classes at the time of data collection. Then, the full texts of the SNS profiles were downloaded. Each teen's gender, age and number of SNS friends was recorded. 

A text-mining tool was used to divide the remaining SNS page content into words. From the top 500 words appearing across all pages, 36 words were chosen to represent five categories of interests, namely extracurricular activities, fashion, religion, romance and antisocial behaviour. The 36 words include terms such as football, sexy, kissed, bible, shopping, death, drugs, etc. The final dataset indicates, for each person, how many times each word appeared in the person's SNS profile. 

First we load the relevant packages and the dataset. Run the cell below

In [1]:
#Keep cell

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline


teens = pd.read_csv("data/snsdata.csv")

Print out the first couple of rows from the teens dataset. You know how ...

In [2]:
teens.head()

Unnamed: 0.1,Unnamed: 0,gradyear,gender,age,friends,basketball,football,soccer,softball,volleyball,...,blonde,mall,shopping,clothes,hollister,abercrombie,die,death,drunk,drugs
0,1,2006,M,19.0,7,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,2006,F,19.0,0,0,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,3,2006,M,18.0,69,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,2006,F,19.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,2006,,19.0,10,0,0,0,0,0,...,0,0,2,0,0,0,0,0,1,1


The teens dataset is now loaded into your environment. Take a close look and make sure you understand how it is produced.

We can use the info() method to output some general information about the dataframe. Run `teens.info()`.

In [3]:
teens.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 41 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    30000 non-null  int64  
 1   gradyear      30000 non-null  int64  
 2   gender        27276 non-null  object 
 3   age           24914 non-null  float64
 4   friends       30000 non-null  int64  
 5   basketball    30000 non-null  int64  
 6   football      30000 non-null  int64  
 7   soccer        30000 non-null  int64  
 8   softball      30000 non-null  int64  
 9   volleyball    30000 non-null  int64  
 10  swimming      30000 non-null  int64  
 11  cheerleading  30000 non-null  int64  
 12  baseball      30000 non-null  int64  
 13  tennis        30000 non-null  int64  
 14  sports        30000 non-null  int64  
 15  cute          30000 non-null  int64  
 16  sex           30000 non-null  int64  
 17  sexy          30000 non-null  int64  
 18  hot           30000 non-nu

Part of this lesson is centered on the issue of looking into real-life data on digital society. We have mentioned earlier that a common problem is that observations/records are missing in such data, which is indicated by the NaN value in Python - as you might remember. 

In the info printout, you can also see that the non-null count is lower for those columns that contain NaN values. Gender is one of them.

Let's watch a quick video first how to deal with dirty data in general.

<video width="90%" height="90%" controls src="img-videos/Session2.mp4" />

That was a lot of information. Let's go through this one step a time with the teens data.

Print out the absolute non-null value counts for gender as well as the relative ones with:
```
print(teens['gender'].value_counts())
print(teens['gender'].value_counts(normalize=True))
```

In [4]:
print(teens['gender'].value_counts())
print(teens['gender'].value_counts(normalize=True))

F    22054
M     5222
Name: gender, dtype: int64
F    0.80855
M    0.19145
Name: gender, dtype: float64


With dropna set to False in value_counts, we can also see NaN index values. Try that ...

In [5]:
teens['gender'].value_counts(dropna=False)

F      22054
M       5222
NaN     2724
Name: gender, dtype: int64

But missing values are not our only problem. At least as common are misreported observations in real-life data. As an example, let’s look at the at the age distribution of the teens' age. You can do this in several ways but you should always print out maximum and minimum values. Run: `teens['age'].describe()`.

In [6]:
teens['age'].describe()

count    24914.000000
mean        17.985791
std          7.865982
min          3.000000
25%         16.000000
50%         17.000000
75%         18.000000
max        107.000000
Name: age, dtype: float64

There are quite a few strange records here. Teens can have a minimum age of less than 4 and a maximum age of over 100! These cannot be considered teenagers. 

As a rule of thumb, let’s assume teenagers are between 13 and 19 years old. Let’s mark all other teens' age entries as invalid. We say that invalid entries should have a NaN value. Set this, please, by typing in `teens.loc[(teens['age'] < 13) | (teens['age'] >= 20), 'age'] = np.nan`.

In [7]:
teens.loc[(teens['age'] < 13) | (teens['age'] >= 20), 'age'] = np.nan

The next step in our data cleaning process is to replace NaN values. Of course, we could simply remove all rows/observations, for which we have null entries. We did this effectively in the senate example. But then there was only one row that contained null values. In the SNS example, we would lose too many rows with such a brute-force approach. So, let’s try and fill the null values with estimated values. 

Let’s start with the gender and replace null values by creating new columns for male and females. 

To this end, we first create a new column to record all the female teenagers. Create a new column 'female' that is set to 1 if the teenager is female and 0 otherwise. Run `teens.loc[(teens['gender'] == 'F') & (teens['gender'].notna()), 'female'] = 1` to set the female column to 1 for females. notna() selects all rows that are not NaN.

In [8]:
teens.loc[(teens['gender'] == 'F') & (teens['gender'].notna()), 'female'] = 1


Can you set the female column to 0 for males? You just need to change one letter ...

In [9]:
teens.loc[(teens['gender'] == 'M') & (teens['gender'].notna()), 'female'] = 0

Next we will create another column for the null values in gender we want to call no_gender. Set this to 1 if there is no gender recorded and otherwise to 0. 

This process is called dummy-coding btw. This is typical to community analysis. A dummy variable is a numerical variable used in the analysis to represent subgroups – in our case males, females and others. In research design, a dummy variable is often used to distinguish different groups to address them differently. By creating a separate column per gender entry, we can compute clusters for separate gender communities. 

Check out dummy-coding on the web. Can you find easier ways to do this in Pandas?

Run 
```
teens.loc[teens['gender'].notna(), 'no_gender'] = 0
teens.loc[teens['gender'].isna(), 'no_gender'] = 1
```

In [10]:
teens.loc[teens['gender'].notna(), 'no_gender'] = 0
teens.loc[teens['gender'].isna(), 'no_gender'] = 1

After this, we have the original column, a new column called female, which contains information about whether the teen is female or not (male) and a new column with information about whether the gender value is missing. Using this column we could, for instance, check with clustering whether certain communities have a tendency not to record their gender values. How? 

Check out the changes with teens.head(). You have to scroll all the way to the right to find the new columns.

In [11]:
teens.head(20)

Unnamed: 0.1,Unnamed: 0,gradyear,gender,age,friends,basketball,football,soccer,softball,volleyball,...,shopping,clothes,hollister,abercrombie,die,death,drunk,drugs,female,no_gender
0,1,2006,M,19.0,7,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.0,0.0
1,2,2006,F,19.0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1.0,0.0
2,3,2006,M,18.0,69,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0.0,0.0
3,4,2006,F,19.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1.0,0.0
4,5,2006,,19.0,10,0,0,0,0,0,...,2,0,0,0,0,0,1,1,,1.0
5,6,2006,F,,142,0,0,0,0,0,...,1,0,0,0,0,0,1,0,1.0,0.0
6,7,2006,F,19.0,72,0,0,0,0,0,...,0,0,2,0,0,0,0,0,1.0,0.0
7,8,2006,M,18.0,17,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0.0,0.0
8,9,2006,F,19.0,52,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1.0,0.0
9,10,2006,F,19.0,39,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1.0,0.0


It's very easy now to calculate the number of teenagers where we do not have gender entries for. How? Remember sum()?

In [12]:
teens['no_gender'].sum()

2724.0

Did you find that there are 2724.

The age column is next. 

Can you find the average age and take care that null values are discounted? Tip: run Pandas' mean and set skipna = True: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html

In [13]:
teens['age'].mean(skipna = True)

17.223233030090974

What would happen if you set skipna to False?

A good strategy to overwrite missing age values would be to use the average age value and assign it to all of the missing ones. This process is called mean-imputation and is employed frequently. Pandas has some real strengths here. Check out https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html.

Pandas makes you life very easy with its fillna function. Run the following cell.

In [14]:
#Keep cell
teens['age'].fillna(teens['age'].mean()).quantile([.25, .5, .75])

0.25    17.000000
0.50    17.223233
0.75    18.000000
Name: age, dtype: float64

Let's further improve this with some good old-fashioned human intelligence.

We feel confident that we can do better than just the mean, because we know the graduation year, too. This is the year our teens are supposed to graduate. It seems a reasonable assumption that those teenagers with an earlier graduation year should be older than those for whom graduation is further away. 

We can easily find this out by running the mean function for each graduation year group separately. Type in `teens[['gradyear', 'age']].groupby(['gradyear']).mean()`. 

In [15]:
teens[['gradyear', 'age']].groupby(['gradyear']).mean()

Unnamed: 0_level_0,age
gradyear,Unnamed: 1_level_1
2006,18.599801
2007,17.685516
2008,16.762584
2009,15.82347


Let's take a moment and look at groupby() as it is essential knowledge to deal with Pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html. It is at the heart of the split-apply-combine paradigm that will keep us busy for the rest of this session: https://pandas.pydata.org/docs/user_guide/groupby.html. Take a close read and you will see that we will cover all its elements throughout the session today. Groupby allows you to to 'split' data into distinct groups. This is often done with the intention to then 'apply' functions afterwards like in our case mean(). This works amazingly well but requires lots of practice in my exeperience. So, let's move on.

According to our last output, our suspicion has proven right. There is a significant difference in the average ages depending on the graduation year. Let’s use this knowledge and update missing values in the age group depending on the graduation year. To this end, you actually have to do a lot of Pandas labour, which demonstrates that 80% of the work of a data scientist lies in working with data. But I am sure you know this by now.

You can, e.g., proceed with the following strategy:
Create first a temporary dataset with the results from the above group_by call. Then merge this new dataset with teens and replace the null values of teens.age with the ones from the temporary dataset.

Create the temporary dataset first with `ave_age = teens[['gradyear', 'age']].groupby(['gradyear'], as_index=False).mean()`. Also print out ave_age. It should be a data frame of three rows. 

Wondering about as_index=False? Check out https://pandas.pydata.org/docs/user_guide/groupby.html. We simply do not want to create a new index from the groupby argument gradyear.

In [16]:
ave_age = teens[['gradyear', 'age']].groupby(['gradyear'], as_index=False).mean()
ave_age

Unnamed: 0,gradyear,age
0,2006,18.599801
1,2007,17.685516
2,2008,16.762584
3,2009,15.82347


Update the teens age columns but make sure that in the end you have not added additional columns. First you need to merge teens and ave_age on gradyear. Run `teens = pd.merge(teens, ave_age, on=['gradyear'])`. Also print out the first few columns of teens.

Merge is another powerful command to learn about. It's part of the Merge, join, concatenate and compare group of command - some of which we have already met: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html. If you happen to know database languages, you will know everything about it. Merge() combines two data frames on common columns - in our case gradyear.

In [17]:
teens = pd.merge(teens, ave_age, on=['gradyear'])
teens.head()

Unnamed: 0.1,Unnamed: 0,gradyear,gender,age_x,friends,basketball,football,soccer,softball,volleyball,...,clothes,hollister,abercrombie,die,death,drunk,drugs,female,no_gender,age_y
0,1,2006,M,19.0,7,0,0,0,0,0,...,0,0,0,0,0,0,0,0.0,0.0,18.599801
1,2,2006,F,19.0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1.0,0.0,18.599801
2,3,2006,M,18.0,69,0,1,0,0,0,...,0,0,0,0,1,0,0,0.0,0.0,18.599801
3,4,2006,F,19.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1.0,0.0,18.599801
4,5,2006,,19.0,10,0,0,0,0,0,...,0,0,0,0,0,1,1,,1.0,18.599801


With some scrolling, you should now see two age columns. age_x is the original one from teens, while age_y is the estimated value based on the gradyear. Now, we want to replace the age_x (the original value) with age_y if age_x is NaN. Run `teens.loc[(teens['age_x'].isna()), 'age_x'] = teens['age_y']`.

In [18]:
teens.loc[(teens['age_x'].isna()), 'age_x'] = teens['age_y']

Now, we only need to so some cleaning up. Give age_x its old name age back and drop age_y, which we don't need anymore. Run the cell below.

In [19]:
#Keep cell
teens.rename(columns={'age_x': 'age'}, inplace=True)
teens.drop('age_y', axis=1, inplace=True)

Use `teens.info()` to see that the age column does not contain NaNs anymore.

In [20]:
teens.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 0 to 29999
Data columns (total 43 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    30000 non-null  int64  
 1   gradyear      30000 non-null  int64  
 2   gender        27276 non-null  object 
 3   age           30000 non-null  float64
 4   friends       30000 non-null  int64  
 5   basketball    30000 non-null  int64  
 6   football      30000 non-null  int64  
 7   soccer        30000 non-null  int64  
 8   softball      30000 non-null  int64  
 9   volleyball    30000 non-null  int64  
 10  swimming      30000 non-null  int64  
 11  cheerleading  30000 non-null  int64  
 12  baseball      30000 non-null  int64  
 13  tennis        30000 non-null  int64  
 14  sports        30000 non-null  int64  
 15  cute          30000 non-null  int64  
 16  sex           30000 non-null  int64  
 17  sexy          30000 non-null  int64  
 18  hot           30000 non-nu

Check out new ave_age with `teens['age'].mean()`.

In [21]:
teens['age'].mean()

17.21784283910326

This is all quite advanced stuff but as long as you remember the kind of steps we have taken you should be able to impute one column's missing values by using another column as a reference. In our case, we use our knowledge that age is dependent on gradyear to find the missing values. Please, take some time to review the steps.

Let’s take a look at the resulting age column with `teens['age'].describe()`.

In [22]:
teens['age'].describe()

count    30000.000000
mean        17.217843
std          1.153788
min         13.000000
25%         16.000000
50%         17.000000
75%         18.000000
max         19.000000
Name: age, dtype: float64

This looks much better. We have now learned how to delete missing values completely or impute them using a background knowledge. 

After we have dealt with the missing records, I think we are ready to cluster again. We will use our trusted k-means without actually referring to either age nor gender. Sorry! But it was good that you learned how to deal with missing values and we will use them later. 

Just like in the US Senate example, we need to first understand, what we are trying to cluster. In the Senate example, we clustered voting behaviour. Now, it will be interests, which we can get from the columns 5 to 40 of the teens data frame. This time we select them by number as there is no clever way of selecting them by expression as for the Senate data.

Please create the interests dataframe by selecting columns 5 to 40 from teens with `interests = teens.iloc[:,5:40]`. Also, print out the first few rows of interests.

In [23]:
interests = teens.iloc[:,5:40]
interests.head()

Unnamed: 0,basketball,football,soccer,softball,volleyball,swimming,cheerleading,baseball,tennis,sports,...,dress,blonde,mall,shopping,clothes,hollister,abercrombie,die,death,drunk
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,4,0,1,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,1


We did not mention this before, because it was not necessary but k-means is very sensitive to input of varying size, length, etc. It was not necessary to focus on this in the previous example, because all the voting behaviour was in the range of 0 to 1. Now, we have interests of very different ranges. The interests are simply based on how many times a keyword appears in teenagers' social networking contributions.

Since k-means is based on calculating distances between data points and their centroids, it will be strongly influenced by the magnitudes of the variables we cluster. Think about if for a moment! Just imagine one column having values running from 1 to 10 and another from 1 to 1000. How could we compare distances between them? 

We therefore need to scale the value so that they all fall into the same range. To this end, Python has the scale function in scipy.stats, which centres values around their mean. Using the apply function, we can tell Python to scale all interests values. apply is an alternative way to loop over values in a column and apply a function: https://www.datacamp.com/community/tutorials/pandas-apply.

Run `from scipy.stats import zscore` to import zscore, which is very popular in data analysis for standardisation: https://www.statisticshowto.com/probability-and-statistics/z-score/

In [24]:
from scipy.stats import zscore

Apply zscore and assign the results to a new dataframe interests_z with `interests_z = interests.apply(zscore)`. Finally, print out the first few columns.

In [25]:
interests_z = interests.apply(zscore)
interests_z.head()

Unnamed: 0,basketball,football,soccer,softball,volleyball,swimming,cheerleading,baseball,tennis,sports,...,dress,blonde,mall,shopping,clothes,hollister,abercrombie,die,death,drunk
0,-0.332217,-0.357697,-0.242874,-0.217928,-0.22367,-0.259971,-0.207327,-0.201131,-0.168939,-0.297123,...,-0.246906,-0.050937,-0.369915,-0.487314,-0.314198,-0.201476,-0.183032,-0.294793,-0.26153,-0.220403
1,-0.332217,1.060049,-0.242874,-0.217928,-0.22367,-0.259971,-0.207327,-0.201131,-0.168939,-0.297123,...,8.653277,-0.050937,1.067392,-0.487314,-0.314198,-0.201476,-0.183032,-0.294793,-0.26153,-0.220403
2,-0.332217,1.060049,-0.242874,-0.217928,-0.22367,-0.259971,-0.207327,-0.201131,-0.168939,-0.297123,...,-0.246906,-0.050937,-0.369915,-0.487314,-0.314198,-0.201476,-0.183032,-0.294793,2.027908,-0.220403
3,-0.332217,-0.357697,-0.242874,-0.217928,-0.22367,-0.259971,-0.207327,-0.201131,-0.168939,-0.297123,...,-0.246906,-0.050937,-0.369915,-0.487314,-0.314198,-0.201476,-0.183032,-0.294793,-0.26153,-0.220403
4,-0.332217,-0.357697,-0.242874,-0.217928,-0.22367,-0.259971,-0.207327,-0.201131,-0.168939,-0.297123,...,-0.246906,-0.050937,-0.369915,2.273673,-0.314198,-0.201476,-0.183032,-0.294793,-0.26153,2.285122


It's clearly normalized around the means of the various columns - by the distance of the standard deviation.

### Clustering

Now we cluster again and start by importing KMeans. Run the cell below.

In [26]:
#Keep cell

from sklearn.cluster import KMeans

We decide 5 clusters is enough and assign k = 5. Run the cell below.

In [27]:
#Keep cell

k = 5

Now we are ready to cluster. Create and fit teen_clusters the way you know. It is just a copy and paste job from before.

In [28]:
teen_clusters = KMeans(n_clusters  = k) 
teen_clusters.fit(interests_z)

KMeans(n_clusters=5)

Let’s investigate the size of the clusters with .labels_ and np.bincount.

In [29]:
np.bincount(teen_clusters.labels_)

array([21555,   606,  6065,   903,   871])

I have noticed very different results depending on the kmeans results. I suggest to rerun kmeans a couple of times until you see a distribution that looks ok. You want to especially avoid clusters of only one 1 item. This would be the time to determine the optimal k with the elbow method, but we don't want to to that right now and move on.

We can also look at the centroids/centres with teen_clusters.cluster_centers_. You learned earlier how to pretty-print this in a data frame. Run:
```
interests_centroids = pd.DataFrame(teen_clusters.cluster_centers_, columns=interests_z.columns)
interests_centroids
```

In [30]:
interests_centroids = pd.DataFrame(teen_clusters.cluster_centers_, columns=interests_z.columns)
interests_centroids

Unnamed: 0,basketball,football,soccer,softball,volleyball,swimming,cheerleading,baseball,tennis,sports,...,dress,blonde,mall,shopping,clothes,hollister,abercrombie,die,death,drunk
0,-0.1657,-0.163319,-0.0895,-0.114934,-0.11767,-0.107491,-0.115132,-0.108819,-0.050377,-0.130277,...,-0.143154,-0.029007,-0.189193,-0.232519,-0.189328,-0.156387,-0.148178,-0.092341,-0.082756,-0.084042
1,-0.098441,0.058736,-0.100744,-0.026073,-0.063793,0.024111,-0.104658,-0.122057,0.038548,-0.097453,...,0.083546,-0.012705,-0.071069,-0.029428,0.007014,-0.168166,-0.141711,0.009078,0.05204,-0.083963
2,0.507589,0.475845,0.287252,0.367224,0.37801,0.303391,0.335872,0.341953,0.135555,0.32814,...,0.389031,0.034689,0.507584,0.669051,0.382935,-0.052228,-0.075133,0.060193,0.122999,0.031513
3,0.456345,0.441453,0.173671,0.222229,0.129358,0.266983,0.188853,0.352879,0.137397,0.873599,...,0.598266,0.407474,0.559639,0.318356,1.384543,0.197711,0.292339,1.740926,1.001084,1.794006
4,0.160015,0.228283,0.103857,0.073874,0.188975,0.253031,0.386454,0.029935,0.133105,0.100143,...,0.154164,0.062557,0.615242,0.783818,0.577309,4.145632,3.985539,0.054507,0.116976,0.058628


A simple way to detect clusters is to find thr maximum values in the columns. Try it by using the idxmax() function from Pandas: `interests_centroids.idxmax()`. Check out its documentation.

In [31]:
interests_centroids.idxmax()

basketball      2
football        2
soccer          2
softball        2
volleyball      2
swimming        2
cheerleading    4
baseball        3
tennis          3
sports          3
cute            3
sex             3
sexy            3
hot             4
kissed          3
dance           3
band            1
marching        1
music           3
rock            3
god             3
church          2
jesus           2
bible           2
hair            3
dress           3
blonde          3
mall            4
shopping        4
clothes         3
hollister       4
abercrombie     4
die             3
death           3
drunk           3
dtype: int64

Hopefully, your results look similar to the table from Lantz (2013) on p. 288: 

![title](img-videos/teen-clusters.jpg)

Do the names of the clusters make sense to you? Do you remember all those teenage movies you watched?

Next, let’s continue with another type of analysis. Let’s first assign each teen to a cluster, as we did before in the voting example for the senators. Please, add a column called 'cluster' to the teen data frame with `teens['cluster'] = np.array(teen_clusters.labels_)`.

In [32]:
teens['cluster'] = np.array(teen_clusters.labels_)

Let's take a look at the teen data frame with head(). All the way to the right, you can find the cluster assignment.

In [33]:
teens.head()

Unnamed: 0.1,Unnamed: 0,gradyear,gender,age,friends,basketball,football,soccer,softball,volleyball,...,clothes,hollister,abercrombie,die,death,drunk,drugs,female,no_gender,cluster
0,1,2006,M,19.0,7,0,0,0,0,0,...,0,0,0,0,0,0,0,0.0,0.0,0
1,2,2006,F,19.0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1.0,0.0,2
2,3,2006,M,18.0,69,0,1,0,0,0,...,0,0,0,0,1,0,0,0.0,0.0,0
3,4,2006,F,19.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1.0,0.0,0
4,5,2006,,19.0,10,0,0,0,0,0,...,0,0,0,0,0,1,1,,1.0,3


Let's print out the first 5 teens and only the columns 'cluster', 'gender', 'age' and 'friends'. I hope you remember how this works. How do we select the first 5 rows? How do we select the columns? Tip: loc is the function you are looking for.

In [34]:
teens.loc[:4, ['cluster', 'gender', 'age', 'friends']]

Unnamed: 0,cluster,gender,age,friends
0,0,M,19.0,7
1,2,F,19.0,0
2,0,M,18.0,69
3,0,F,19.0,0
4,3,,19.0,10


Since we have learned earlier how to group by particular interests, let’s aggregate the teens' features using the clusters. 

Print out the average ages per cluster. Do you remember how this works? Tip: Replace gradyear with cluster and you are ready to go.

In [35]:
teens[['age', 'cluster']].groupby(['cluster'], as_index=False).mean()

Unnamed: 0,cluster,age
0,0,17.275689
1,1,17.377248
2,2,17.068167
3,3,17.087143
4,4,16.853126


The clusters do not differ in terms of ages very much. There is no immediate relation between age and interest clusters. Now, let’s look at the female contribution to each cluster. How? Which column to you have to use instead of age?

In [36]:
teens[['female', 'cluster']].groupby(['cluster'], as_index=False).mean()

Unnamed: 0,cluster,female
0,0,0.777703
1,1,0.772329
2,2,0.893979
3,3,0.87013
4,4,0.906716


Overall, 74 per cent of the SNS's users are female. That’s why they contribute so much to each cluster. Can you see the cluster that has the most female users? Do you know why? Look back to the earlier analysis of the interests linked to the clusters ...

You can check for the average number of friends per cluster now. Just define the target of the aggregation per cluster as friends instead of female or age in the expressions above.

In [37]:
teens[['friends', 'cluster']].groupby(['cluster'], as_index=False).mean()

Unnamed: 0,cluster,friends
0,0,27.63614
1,1,32.457096
2,2,37.134872
3,3,31.831672
4,4,41.390356


Here, the differences are stronger. We suspect that the number of friends played a key role in assigning the clusters. That’s the nature of a social network, I guess. 

We have now completed our exercises on clustering and understanding political and social communities. For today, we have just one more important question to answer. How do we get access to the kind of data we worked on today? The teens dataset stemmed from a research project in sociology published online, while the US Senate voting behaviour was downloaded from US government websites. 

## Web Scraping

The web has become a unique source of data for social analysis. Munzert et al. (2014) in their book on 'Automated data collection (...). A practical guide to web scraping and text mining' (John Wiley & Sons) emphasize in the Introduction that 'the rapid growth of the World Wide Web over the past two decades tremendously changed the way we share, collect, and publish data. Firms, public institutions, and private users provide every imaginable type of information and new channels of communication generate vast amounts of data on human behavior. What was once a fundamental problem for the social sciences — the scarcity and inaccessibility of observations — is quickly turning into an abundance of data. This turn of events does not come without problems. (…), traditional techniques for collecting and analyzing data may no longer suffice to overcome the tangled masses of data.' (p. XV). 

In short, we can find lots of data on the web. A big problem with web data is, however, that it is often inconsistent and heterogeneous. To get access to it, one often has to visit multiple web sites and assemble their data together. Finally, the data is generally published without reuse in mind, which implies that the data can be of low quality. That said, the web is so vast that it still provides an often overwhelming source of exciting data. 

Let's take a look at how we can access web data in general by scraping web sites.

<video width="90%" height="90%" controls src="img-videos/Session3.mp4" />

Returning to our first example of political communities, we will scrape data on the current composition of the US Senate from Wikipedia. 

This can be a complex task and requires additional libraries. But in this case, we can rely on Pandas directly with its read_html function that does all the hard work for you. Check it at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html.

All the content on the web is presented to us in a language called HyperText Markup Language (HTML; https://en.wikipedia.org/wiki/HTML). HTML is of course a way of presenting content on the web in a universal way. It also contains so-called hyperlinks that let you jump from web content to web content. 

If you are interested in the further details of HTML, why not take some time now to visit the excellent http://www.w3schools.com/html/, which contains a lot of practical exercises to learn everything about HTML and other web technologies. 

Each document on the web is identified by a URL. We set the url to the wikipedia page of current US senators and run the below cell.

In [38]:
#Run the code below

url = 'https://en.wikipedia.org/wiki/Current_members_of_the_United_States_Senate'

With the read_html function of Pandas, we can read in the web content behind the URL. However, if you check the documentation you need to provide Pandas with further details.

Unfortunately, web content in HTML format is not very structured and often simply chaotic. We would like to download only the table of the page of current US Senators and need to find a so-called 'match' for read_html to choose that table from the HTML document. 

Fortunately, for us there are many existing strategies to determine exactly the HTML element we would like to select. I looked for specific names in the table that are not repeated in the rest of the wiki page. Run: `senator_wiki = pd.read_html('https://en.wikipedia.org/wiki/List_of_current_United_States_senators', match = 'Richard Shelby')`.

In [39]:
senator_wiki = pd.read_html('https://en.wikipedia.org/wiki/List_of_current_United_States_senators', match = 'Richard Shelby')

If it all worked as it should run the cell below to create our senators dataframe.

In [40]:
#Keep cell
senators = senator_wiki[0]
senators.head()

Unnamed: 0,State,Portrait,Senator,Party,Party.1,Born,Occupation(s),Previous electiveoffice(s),Education,Assumed office,Term up,Residence
0,Alabama,,Richard Shelby,,Republican[d],(age 87),Lawyer,U.S. HouseAlabama Senate,"University of Alabama (BA, LLB) Birmingham Sch...","January 3, 1987",2022,Tuscaloosa[2]
1,Alabama,,Tommy Tuberville,,Republican,(age 67),"College football coachPartner, investment mana...",,Southern Arkansas University (BS),"January 3, 2021",2026,Auburn
2,Alaska,,Lisa Murkowski,,Republican,(age 64),Lawyer,Alaska House of Representatives,Georgetown University (AB) Willamette Universi...,"December 20, 2002[e]",2022,Girdwood[3]
3,Alaska,,Dan Sullivan,,Republican,(age 57),U.S. Marine Corps officerLawyerAssistant Secre...,Alaska Attorney General,Harvard University (AB) Georgetown University ...,"January 3, 2015",2026,Anchorage[4]
4,Arizona,,Kyrsten Sinema,,Democratic,(age 45),Social workerPolitical activistLawyerCollege l...,U.S. HouseArizona SenateArizona House of Repre...,Brigham Young University (BA) Arizona State Un...,"January 3, 2019",2024,Phoenix[5]


Fortunately, the data is already of fairly good quality, but we still need to clean the data a bit.

Let's do some basic cleaning, where we ignore the strange textual errors and focus on the various columns that require direct attention. Please:

(1) Create a 'Party' column from whatever name read_html has given that column. In my case, it was called Party.1

In [41]:
senators['Party'] = senators['Party.1']

(2) Make sure that the 'Term up' is of type int.

Transfrom the column 'Term up' into an integer column with `senators['Term up'] = senators['Term up'].astype(int)`.

In [42]:
senators['Term up'] = senators['Term up'].astype(int)

(3) Clean the column 'Born' to only contain the numerical age and rename it to 'Age'.

As you can see the column Born contains the age of the senator, which we would like to extract from the string. As far as we can see these are the two digits in an otherwise string of letters. So, in (age 87) it is 87. To extract the 87, we can use regular expressions with Pandas. Take a look at https://www.dataquest.io/blog/regular-expressions-data-scientists/. We need the str.extract function https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html, which is very powerful. Run `senators['Age'] = senators['Born'].str.extract(r'.*(\d\d)')`.

The expression with extract says to read (r) all characters in the string (.*) and look for two digits (\d\d) to return these. Regular expressions require some practice and trial and error in my experience.

In [43]:
senators['Age'] = senators['Born'].str.extract(r'.*(\d\d)')

(4) Create a 'Years in Office' column that uses the information in 'Assumed office' to calculate how long the senator has served. Make sure that this column is of type int.

We first need to know which year we are currently in. We will use this to calculate the years left in office. We can use the datetime library and its now() function. Run the cell below.

In [44]:
#Keep cell
import datetime
year_ = datetime.datetime.now().year
year_

2021

The years in the office of a senator can be calculated by subtracting fro year_ the year when the office was assumed. We can use our regular expression knowledge and simply look for the strings which are four consecutive numbers (\d\d\d\d) and return those. Type in `senators['Years in Office'] = year_ - senators['Assumed office'].str.extract(r'.*,.*(\d\d\d\d).*').astype(int)`. Observe that we use astype(int) to make the extracted string into an int.

In [45]:
senators['Years in Office'] = year_ - senators['Assumed office'].str.extract(r'.*,.*(\d\d\d\d).*').astype(int)

Finally, let's delete all unnecessary columns that you now changed such as 'Born' and 'Party.1'. Run the cell below.

In [46]:
#Keep cell
senators.drop(['Party.1', 'Born'], 1, inplace = True)
senators.head()

  senators.drop(['Party.1', 'Born'], 1, inplace = True)


Unnamed: 0,State,Portrait,Senator,Party,Occupation(s),Previous electiveoffice(s),Education,Assumed office,Term up,Residence,Age,Years in Office
0,Alabama,,Richard Shelby,Republican[d],Lawyer,U.S. HouseAlabama Senate,"University of Alabama (BA, LLB) Birmingham Sch...","January 3, 1987",2022,Tuscaloosa[2],87,34
1,Alabama,,Tommy Tuberville,Republican,"College football coachPartner, investment mana...",,Southern Arkansas University (BS),"January 3, 2021",2026,Auburn,67,0
2,Alaska,,Lisa Murkowski,Republican,Lawyer,Alaska House of Representatives,Georgetown University (AB) Willamette Universi...,"December 20, 2002[e]",2022,Girdwood[3],64,19
3,Alaska,,Dan Sullivan,Republican,U.S. Marine Corps officerLawyerAssistant Secre...,Alaska Attorney General,Harvard University (AB) Georgetown University ...,"January 3, 2015",2026,Anchorage[4],57,6
4,Arizona,,Kyrsten Sinema,Democratic,Social workerPolitical activistLawyerCollege l...,U.S. HouseArizona SenateArizona House of Repre...,Brigham Young University (BA) Arizona State Un...,"January 3, 2019",2024,Phoenix[5],45,2


Just like before, we now would like to ask questions against the dataset and explore it. 

In particular, we would like to understand the pressure on parties during the next election for the US Senate. At the time of writing, these were the 2022 elections for Congress. We could now reuse some of the strategies for exploring data in Pandas we have learned about earlier.

Let's look into the questions when their seats are up again for the senators. Create a new dataframe with copy() from senators that only contains the 'Senator', 'State', 'Party', 'Occupation(s)', 'Years in Office' and 'Term up' columns. Call it senators_seatup.

Run `senators_seatup = senators[['Senator', 'State', 'Party', 'Occupation(s)', 'Years in Office', 'Term up']].copy()`.

In [47]:
senators_seatup = senators[['Senator', 'State', 'Party', 'Occupation(s)', 'Years in Office', 'Term up']].copy()       

Take a look at the first couple of rows of the data, and you will only find those columns you selected.

In [48]:
senators_seatup.head()

Unnamed: 0,Senator,State,Party,Occupation(s),Years in Office,Term up
0,Richard Shelby,Alabama,Republican[d],Lawyer,34,2022
1,Tommy Tuberville,Alabama,Republican,"College football coachPartner, investment mana...",0,2026
2,Lisa Murkowski,Alaska,Republican,Lawyer,19,2022
3,Dan Sullivan,Alaska,Republican,U.S. Marine Corps officerLawyerAssistant Secre...,6,2026
4,Kyrsten Sinema,Arizona,Democratic,Social workerPolitical activistLawyerCollege l...,2,2024


What are the types? Do you need to change them?

In [49]:
senators_seatup.dtypes

Senator            object
State              object
Party              object
Occupation(s)      object
Years in Office     int64
Term up             int64
dtype: object

In my case, they were ok. They are only strings and ints - all in the right place.

Next, we would like to determine the next time an election is held. This information is in the 'Term up' column and there logically the smallest value. Assign that value to a variable next_election and pring it out by running 
```
next_election = senators_seatup['Term up'].min()
next_election
```

In [50]:
next_election = senators_seatup['Term up'].min()
next_election

2022

Now, we select the rows/observations that are relevant for the next election and filter the senators_seatup rows with next_election. Assign the results to senators_seatup_next. Do you remember how to do this? If not check https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html for a quick reference.

In [51]:
senators_seatup_next = senators_seatup[senators_seatup['Term up'] == next_election]

Display all the senators whose seats are up.

In [52]:
senators_seatup_next

Unnamed: 0,Senator,State,Party,Occupation(s),Years in Office,Term up
0,Richard Shelby,Alabama,Republican[d],Lawyer,34,2022
2,Lisa Murkowski,Alaska,Republican,Lawyer,19,2022
5,Mark Kelly,Arizona,Democratic,"U.S. Navy officerNASA AstronautFounder, Americ...",1,2022
6,John Boozman,Arkansas,Republican,Optometrist,10,2022
9,Alex Padilla,California,Democratic,Engineer,0,2022
10,Michael Bennet,Colorado,Democratic,LawyerInvestment company executiveDenver Publi...,12,2022
12,Richard Blumenthal,Connecticut,Democratic,Marine Corps Reserve SergeantSenate stafferLaw...,10,2022
16,Marco Rubio,Florida,Republican,Lawyer,10,2022
19,Raphael Warnock,Georgia,Democratic,Pastor,0,2022
20,Brian Schatz,Hawaii,Democratic,TeacherNonprofit organization executive,9,2022


So far so good. Let's next group observations together to gain composite insights. Let's look at the senators per US state. Use senators_seatup_next and the columns 'State' and 'Term up' to display the number of terms that are up in the next election. Run `senators_seatup_next[['State', 'Term up']].groupby(['State'], as_index=False).count()`.

In [53]:
senators_seatup_next[['State', 'Term up']].groupby(['State'], as_index=False).count()

Unnamed: 0,State,Term up
0,Alabama,1
1,Alaska,1
2,Arizona,1
3,Arkansas,1
4,California,1
5,Colorado,1
6,Connecticut,1
7,Florida,1
8,Georgia,1
9,Hawaii,1


At least in 2021, there were quite a few senators up for re-election. How does it look for you? 

Finally, we wanted to look into the election challenges per party. Select 'Party' and 'Term up' and group by party to display the results with count(), please. You can do it ...

In [54]:
senators_seatup_next[['Party', 'Term up']].groupby(['Party'], as_index=False).count()

Unnamed: 0,Party,Term up
0,Democratic,14
1,Republican,19
2,Republican[d],1


Republicans had far more seats to lose in 2021. You might see different results depending on the election you look at. Let's try and find out a little bit more about the senators up for re-election.  What is their median time in office if you group them by party? 

The agg function is the last very powerful Pandas tool we introduce: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html. Used together with groupby on party we can apply easily several functions like median and count in our case: `senators_seatup_next[['Party', 'Years in Office']].groupby(['Party']).agg(['median', 'count'])`. You should read this expression as a pipleine, where you first select two columns from senators_seatup_next, then group by one of the columns and aggregate the values with two functions.

In [55]:
senators_seatup_next[['Party', 'Years in Office']].groupby(['Party']).agg(['median', 'count'])

Unnamed: 0_level_0,Years in Office,Years in Office
Unnamed: 0_level_1,median,count
Party,Unnamed: 1_level_2,Unnamed: 2_level_2
Democratic,6.5,14
Republican,10.0,19
Republican[d],34.0,1


In 2021, the Democrats had much younger senators who had also served much shorter, which might indicate that they had much less time to secure the seat for their party. Your results will of course depend on the year you are looking at but can you find similar patterns? 

That's it for today's analysis of social communities with the additional bonus of learning a little bit about how to harvest data from the web, which is already advanced stuff. Thank you ...