# Lecture 2 - Introduction to Social Media Analytics with Python

In this notebook we will learn the basics for analyzing  social media data with Python.  We will study tweets collected by keyword, tweets collected by user, and user profiles.  Some of the skills you will learn include searching and sorting dataframes and making bar and scatter plots.  For more details on the dataframe functions used in this notebook, you can look here: https://pandas.pydata.org/docs/index.html

This notebook can be opened in Colab 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zlisto/social_media_analytics/blob/main/Lecture02_BasicSocialMediaDataAnalysis.ipynb)

Before starting, select "Runtime->Factory reset runtime" to start with your directories and environment in the base state.

If you want to save changes to the notebook, select "File->Save a copy in Drive" from the top menu in Colab.  This will save the notebook in your Google Drive.



# Clone GitHub Repository
This will clone the repository to your machine.  This includes the code and data files.  Then change into the directory of the repository.

In [None]:
!git clone https://github.com/zlisto/social_media_analytics

import os
os.chdir("social_media_analytics")

## Install Requirements 



In [None]:
!pip install -r requirements.txt

## Import packages

We import the packages we are going to use.  A package contains several useful functions that make our life easier.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from scripts.api import *


#this option makes it so tweets display nicely in a dataframe
pd.set_option("display.max_colwidth", None)



# Keyword Tweets

We begin with a set of tweets that contain a specific hashtag.  These were found using Twitter's Search API.  The tweets were saved in a table in a database file.  The filename of the database is stored in the variable `fname_db`.  The table's name is `"keyword_tweets"`.  Each row
of this table is a tweet with many columns of information.  The most important columns are "created_at,screen_name,text".  





### Load Keyword Tweets
We load the tweets from the database using the `DB.fetch` function. The tweets are loaded into a variable called **df** which is a *dataframe*.  Dataframes store each tweet as a row and let us access the rows and columns easily.  We will use dataframes a lot.

In [None]:
#filename of database
fname_db = "data/lecture_02"  #database filenmae

df = DB.fetch(table_name='keyword_tweets', path=fname_db)


### Look at the tweets using head() function
After we load the tweets in **df**, we look at the first few tweets using the *head* function.  We can specify how many rows to show using the *n* parameter.



In [None]:
df.head(n=2)

### Select Columns of Dataframe

Sometimes we just want to look at a few columns of the dataframe.  We can do this by putting the names of the columns we want into a *list*.  In Python, lists have the format `[item_1,item_2,...,item_n]`.  

In [None]:
col = ['screen_name','text']
df[ col].head(n=12)

### Sample Rows of Dataframe

The `head` function will give the first few rows of the dataframe.  We can use the `sample` function to randomly sample a fixed number of rows.

In [None]:
df[ ['screen_name','text']].sample(n=2)

### Search for Tweets Containing Keywords

We can search for tweets in the dataframe which contains a specific keyword.  We do this with the `contains` function.  This function takes the keyword as input in the form of a string (this means you put the word inside quotes).  It also has a parameter `case` which is `True` if you want to match the case of the keyword.



In [None]:
keyword = 'eminem'

df[df.text.str.contains(keyword, case = False)][['screen_name','text']].sample(n=5)


### Add Column to Dataframe

We can add a column to the dataframe to make data analysis easier.  Let's add a column called `"has_keyword"` which is `True` if the tweet has the word "eminem".  This can be done by doing `df["has_keyword"] = column you want to add`.  In our case, the column we want to add is given by `df.text.str.contains(keyword, case = False)`. 


In [None]:
keyword = 'eminem'

df['has_keyword'] = df.text.str.contains(keyword, case = False)
df.head()

### Count Rows in Dataframe

We can use the `len` function to find out how many rows a dataframe has.  Let's find out how many tweets contain our keyword, and then print out the result. We can use the column we just created for this to make the code cleaner.

In [None]:
df_keyword = df[df['has_keyword']==True][['screen_name','text']] 
n_keyword = len(df_keyword)

print(f"There are {n_keyword} tweets that contain the keyword '{keyword}' ")

### Sort Rows By Column Values
We can sort a dataframe's rows by the values in a column with the `sort_value` function.  It takes as input a list of columns, and an optional parameter `ascending` which can be `True` or `False`. 

Let's sort the tweets in order of decreasing `retweet_count`.

In [None]:
df[['retweet_count','screen_name','text']].sort_values(by = ['retweet_count'], ascending = False).head(n=5)

### Statistics of Columns

There are built-in functions in a dataframe to calculate many different statistis, such as `mean`, `median`, `variance`, `std`, and `quantile`.  For `quantile` we need to set the quantile we want in the variable `q`.

In [None]:
mean = df['retweet_count'].mean()
med = df['retweet_count'].median()
std = df['retweet_count'].std()
q = 0.9
quant = df['retweet_count'].quantile(q)

print(f"Retweet count\n\tmean = {mean:.2f}\n\tmedian = {med:.2f}\n\tst. dev. = {std}\n\t{q:.2f} quantile = {quant}\n")

# User Profiles

We next look at a table of user profiles.  These are in the same database in the table `users`.

### Load User Profiles
We load the user profiles from the database using the `DB.fetch` function. The profiles are loaded into a dataframe called **df_u**.

In [None]:
#filename of database
fname_db = "data/lecture_02"  #database filenmae

df_u = DB.fetch(table_name='users', path=fname_db)


print(f"We have {len(df_u)} user profiles")
df_u.head()

### Bar Graph of Follower Count

We can make a bar graph of the follower count of the users.  To make the plot, we use the `barplot` function in the *seaborn* package.  Details on the seaborn package can be found here: https://seaborn.pydata.org/#

To use `barplot`, we need to input the `data`, which is the dataframe, `x`, which is the name of the column for the x-axis, and `y`, which is the name of the column for the y-axis.  There are many other functions that let us edit the plot to make it look nice.  These are from the *matplotlib* package.  One parameter is the `color` parameter.  A complete list of colors is found here: https://matplotlib.org/stable/gallery/color/named_colors.html

In [None]:
fig = plt.figure(figsize = (8,6))
sns.barplot(data = df_u, x = 'screen_name', y = 'followers_count',
           color = 'crimson')
plt.xlabel('Screen name', fontsize = 16)
plt.ylabel('Followers count', fontsize = 16)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.title("Twitter Users", fontsize = 20)
plt.grid()
plt.yscale("log")
plt.show()

# User Tweets

The tweets here were collected from the Twitter timelines of a set of users.  

### Load User Tweets

The tweets are in the same database in a table called `"user_tweets"`.  We can load them with the `DB.fetch` function into a dataframe called `df_ut` (ut for user tweets).

In [None]:
#filename of database
fname_db = 'data/lecture_02'

df_ut = DB.fetch(table_name = 'user_tweets', path = fname_db)

print(f"We have {len(df_ut)} user tweets")
df_ut.sample(n=5)

### Group Tweets

We can group the tweets using the `groupby` function.  Once we group the tweets, we can calculate apply other functions to tweets in the group, such as `mean`.  We do this for the `retweet_count` column.

In [None]:
df_ut.groupby('screen_name').mean()



### Barplot Retweet Count of Groups

We can make a barplot of a column value on the y-axis, and the group on the x-axis.  Seaborn knows to group together tweets in the same group, and plot the mean value along with error bars. In this case, we will plot `retweet_count` on the y-axis, and the groups are the `screen_name` column.

In [None]:
fig = plt.figure(figsize = (8,6))
sns.barplot(data = df_ut, x = 'screen_name', y = 'retweet_count', color = "blue")
plt.xlabel('Screen name', fontsize = 16)
plt.ylabel('Retweet count', fontsize = 16)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.title("Twitter Users", fontsize = 20)
plt.grid()
plt.yscale("log")
plt.show()


### Subplots

We can plot two figures side by side using the `subplot` function.  You need to specify the number of rows and columns in your subplot grid, and specify which grid box the plot goes in.  It is something like this: `subplot(rows, columns, box_number)`.


In [None]:
fig = plt.figure(figsize = (16,6))

plt.subplot(1,2,1)
sns.barplot(data = df_u, x = 'screen_name', y = 'followers_count',
           color = 'crimson')
plt.xlabel('Screen name', fontsize = 16)
plt.ylabel('Followers count', fontsize = 16)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.title("Twitter Users", fontsize = 20)
plt.grid()
plt.yscale("log")

plt.subplot(1,2,2)
sns.barplot(data = df_ut, x = 'screen_name', y = 'retweet_count', color = "blue")
plt.xlabel('Screen name', fontsize = 16)
plt.ylabel('Retweet count', fontsize = 16)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.title("Twitter Users", fontsize = 20)
plt.grid()
plt.yscale("log")



plt.show()

### Compare Retweet Count of Tweets Containing Different Keywords

We can compare the retweet count of tweets that contain a keyword versus those that do not.  We do this by adding a column to the dataframe called `has_keyword` that is `True` if the tweet has the word.  We can then plot the retweet count grouped by screen name, and separate within the group those where `has_keyword` is `True` and `False`.  We use the `hue` parameter for this in-group separation.

In [None]:
keyword = 'drops'

df_ut['has_keyword'] = df_ut.text.str.contains(keyword, case = False)

fig = plt.figure(figsize = (8,6))
sns.barplot(data = df_ut, x = 'screen_name', y = 'retweet_count',
                hue = 'has_keyword')
plt.xlabel('Screen name',fontsize  = 14)
plt.ylabel('Retweet count',fontsize  = 14)
plt.title(f"Keyword ={keyword}",fontsize = 18)
plt.yscale('log')
plt.grid()
plt.show()

### Describe Groups

We can group the tweets by `screen_name` and `has_keyword` using the `groupby` function.  Then we can summarize the statistics of the groups in a dataframe by using the `describe` function.

In [None]:
print(f"Keyword is {keyword}")
df_ut.groupby(['screen_name','has_keyword'])[['retweet_count']].describe()

# Save Notebook to HTML

This last line will let you save the notebook and all of its outputs to an HTML file.  You can download this file to your computer from Colab and then print it to a PDF using ctrl+P.

In [None]:
!jupyter nbconvert --to html <notebook_path>