# Text Analysis Workshop

We'll be working with two datasets in this workshop. One, `plot_summaries`, contains film summaries extracted from Wikipedia articles while the other dataset, `metadata`, contains further information about each movie such as Title, Release Year and Genres.


We'll go through manipulating and plotting this data with python first. Then we'll cover some basic text analysis of the summaries. Finally we'll train a classifier that when given a film summary can predict its genre. 

For an introduction to how this web tool (Google Colab Notebook) works go [here](https://colab.research.google.com/notebooks/intro.ipynb). Basically it lets us write and run code online without downloading anything to our computers. 

We'll be writing code in python throughout but you don't need to have a lot of experience with python to follow along.

# Run the following cells in order to set up the data and code libraries we will be using in the workshop.

In [None]:
#@title Download data {display-mode: "form"}
#@markdown [README for original data](http://www.cs.cmu.edu/~ark/personas/data/README.txt)
#@markdown Press the run button to the left to download the movie dataset we will be using in this workshop.
# This code will be hidden when the notebook is loaded.

movie_metadata_link = "https://www.dropbox.com/s/75no5lozv4og3g9/movie_metadata.csv?dl=0"
movie_plots_link = "https://www.dropbox.com/s/71rlzwy4xvaq8nh/plot_summaries.csv?dl=0"
!wget -O movie_metadata.csv {movie_metadata_link}
!wget -O plot_summaries.csv {movie_plots_link}

# update library versions
!pip install -U seaborn


## Import python libraries

In [None]:
import pandas as pd # pandas library for loading and manipulating data
import numpy as np
from ast import literal_eval # function for loading lists correctly
from sklearn.preprocessing import MultiLabelBinarizer # method for turning a list feature into columns

from collections import Counter # object for counting frequency of words
import nltk # nltk library for text analysis functions
from nltk.corpus import stopwords # list of stopwords from nltk
from nltk.stem import WordNetLemmatizer # function for "lemmatizing" words
import string # library of functions for dealing with strings

# plotting libraries
import seaborn
import matplotlib.pyplot as plt
from wordcloud import WordCloud


In [None]:
# datasets from the nltk library 
# stopwords
nltk.download("stopwords")
# lemmatizer 
nltk.download("wordnet")

In [None]:
# some settings for display of data in notebook
pd.options.display.max_rows = 1000
pd.options.display.max_colwidth = 500

# Load Movie Data

We use the [__pandas library__]() to load the dataset into a useable format in python.

In [None]:
plot_summaries = pd.read_csv("plot_summaries.csv")

In [None]:
metadata = pd.read_csv("movie_metadata.csv",
                       converters={"languages": literal_eval,
                                   "countries": literal_eval,
                                   "genres": literal_eval},
                       # this loads the lists as lists instead of strings
                       encoding="utf-8") 
                       # this loads the text in the correct encoding format

## What does this dataset contain?

In [None]:
# this will show us the column names and data types of the metadata dataframe
metadata.dtypes 

In [None]:
# do the same for the plot_summaries dataframe

In [None]:
# number of rows in the summaries "dataframe"
plot_summaries.shape[0]

In [None]:
# we expect each row to have a unique wikiId
plot_summaries.wikiId.nunique()

In [None]:
# how many rows does the metadata dataframe have?

In [None]:
# does each metadata row have a unique wikiId value?

In [None]:
# let's look at the data for one movie - pick a movie title
metadata[metadata.name==""]

## Exploring the genres feature

In [None]:
# This function will take the genres feature and transform it from a list into its own dataframe
def list_into_df(df, list_col):
  mlb = MultiLabelBinarizer()
  new_cols = pd.DataFrame(mlb.fit_transform(df[list_col]), columns=mlb.classes_, index=df.index)
  return new_cols

In [None]:
# create the "genres belonging to each movie" dataframe
genre_df = list_into_df(metadata, 'genres')

In [None]:
# which genres are the most common?

In [None]:
# which genres are the least common?

In [None]:
# what proportion of films belong to each genre 

In [None]:
# how many genres does each movie have? 

In [None]:
# what is the average number of genres that each movie has?

In [None]:
# plot the number of genres per movie in a `barplot`

### Let's focus on one genre

In [None]:
# create a feature that is True if a row is a 'Romance Film' and False if it is not.

In [None]:
# what proportion of films are classified as 'Romance Film'?

In [None]:
# how has the proportion of films that are 'Romance Film' genre changed over time?

In [None]:
# plot the proportion of 'Romance Film' over time

# Text Analysis
Let's move on the analysing the plot summary data directly.


You may have noticed already that there are more rows in the metadata dataset than there are in the plot_summaries dataset. For this portion we only care about the movies that are in both datasets. So let's start by joining these two datasets together and only keeping movies that are in both. 

In [None]:
# how many plot_summaries wikiId values are also in the metadata dataframe?
metadata.wikiId.isin(plot_summaries.wikiId)

In [None]:
# let's merge the dataframes using the wikiId to join them.

## What are the most frequently occuring words in all the plot_summaries?

In [None]:
# join all the plot summaries into one string

In [None]:
# turn the long string into a list of words

In [None]:
# count the occurance of each word in the list

## Removing 'uninteresting' words.
Some of these words occur very frequently but don't give us much information about the content of the summary. e.g. "and" is a very common word.


In [None]:
# create a list of uninteresting words and filter our original list of words.

## Clean messy data
This approach of splitting a long string into words leaves us with some "words" that don't exactly match what we're looking for.

In [None]:
# this creates a table of replacements for any punctuation
punc_table = str.maketrans({p: "" for p in string.punctuation})

In [None]:
# remove all the punctuation from the list of words

In [None]:
# remove all empty strings from the list of words

## Tidying verbs and other forms of words.
Some words such as verbs like talk can appear in multiple forms in text. e.g. talk, talking, talked, ...


However their meaning is roughly the same for our purpose. We can try turn all versions of these verbs into one form so our most common count of words is more accurate. 

In [None]:
wnl = nltk.WordNetLemmatizer()

In [None]:
wnl.lemmatize("talking")

In [None]:
# lets lemmatize all of the words in our list of tokens.

## Plotting word frequency

In [None]:
# plot the top 10 most frequent words 

In [None]:
# generate a 'wordcloud'
DICT_OF_WORD_FREQUENCIES = 
wordcloud = WordCloud(max_font_size=60,
                      background_color='white').generate_from_frequencies(DICT_OF_WORD_FREQUENCIES)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

# Naive Bayes Classifier

The Naive Bayes Theorem:
$P(H | E ) = \frac{P(E | H)P(H)}{P(E)}$

This probability on the left-hand side is the probability of our hypothesis given the evidence we have. In our case our hypothesis is that this film is a 'Romance Film'. Our evidence is the plot summary of the film. 

We can compute this probability by using the equation on the right-hand side. Let's go through what each of the terms mean:

- Hypothesis = Film is a Romance film 
- Evidence = The words used in the film summary
- P(Evidence | Hypothesis) = The probability of these words being used given the film is a Romance Film.  
- P(Hypothesis) = $\frac{\text{Number of romance films}}{\text{Total number of films}}$

- P(Evidence) = The probability of these words in the summary being used.


We're going to base these probabilities on the data we have - so for example the probability of the words occuring in the summary is going to be based on our count of how frequently the words occur in the dataset summaries. 

In [None]:
# Let's split our data into train and test sets

In [None]:
# let's find P(love | film is not a 'Romance Film') in the train set

In [None]:
# first lets recreate our method for getting a list of tokens from a string

In [None]:
# apply our list of tokens function to the 'Romance Film's in the train set

In [None]:
# get the count of words in the list

In [None]:
# sum all of the counts for all the words

In [None]:
# get the count for the word love

In [None]:
# divide count_love by count_all_the_words

In [None]:
# let's compute the P(Film is Romance | "love" in the summary)
# i.e. let's compute P("love" | Film is Romance)*P(Film is Romance)

In [None]:
# now get the count of love and count of all the other words in the set of non-Romance Films
# and compute the P(Film is not Romance | "love" in the summary)

In [None]:
# lets put this all together and compute all the counts for all of the words in our dataset

In [None]:
# compute P(Romance Film) and P(Not Romance Film)

In [None]:
# compute for each word in train dataset the counts for 1. Romance Films and 2. Not Romance Films

In [None]:
# define a function that takes a summary and returns the probability of it belonging to a Romance Film and the probability of it belonging to not a Romance Film

In [None]:
# lets test our model out on a movie from the test set

# Further Links and Resources

## Documentation
- [Dataset Documentation](http://www.cs.cmu.edu/~ark/personas/data/README.txt)
- [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html)
  - [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)
- [WordCloud Documentation](https://amueller.github.io/word_cloud/)
- [NLTK Documentation](https://www.nltk.org/book/)
  - [API Docs](http://www.nltk.org/api/nltk.html)
  - [WordNet Lemmatizer](http://www.nltk.org/api/nltk.stem.html?highlight=lemmatizer#module-nltk.stem.wordnet)
  - [Stopwords](https://www.nltk.org/book/ch02.html)
- [Seaborn plotting library](https://seaborn.pydata.org/tutorial.html)

## Online Courses

- [Learn Python @ Codecademy (free trial available)](https://www.codecademy.com/learn/learn-python-3)
- [Data Vizualisation](https://infovis.fh-potsdam.de/tutorials/)
- [Natural Language Processing(NLP)](https://lena-voita.github.io/nlp_course.html#main_page_content)
- [NLP with the Spacy library](https://course.spacy.io/en/)
- [Machine Learning](https://course.fast.ai/)

## Videos

- [Bayes Theorem](https://www.youtube.com/watch?v=HZGCoVF3YvM)

## Tutorials

- [Pandas getting started tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
- [Scikit-learn tutorials](https://scikit-learn.org/stable/getting_started.html)
- [scikit-learn Naive Bayes explanation](https://scikit-learn.org/stable/modules/naive_bayes.html)
- [Speech and Language Processing chapter on Naive Bayes](https://web.stanford.edu/~jurafsky/slp3/4.pdf) 