# Assignment 3: Prediction/Modeling
Due: Friday, Nov 22, 2019 in class

Submission: Complete this notebook and print out the output or electronically submit it.

Everything you need to complete is marked with a TODO. For textual questions create a new cell under the question to respond to it.

## Dataset
Game of Thrones is one of the most watched TV series of all times.  With hundreds of characters and more than 22K sentences, this dataset aims to help you test your text mining skills. The content is pretty simple: the dataset contains each and every sentence said in the serie together with who has said it, the episode and the season. For the time being the dataset includes episodes from Season 1 to Season 7. You can download the dataset here: https://github.com/sjyk/cmsc21800/blob/master/got.csv

### Loading the Dataset
The first task is to load the dataset into a pandas dataframe and filter relevant rows for this assignment. We only care about the rows for chracters that are present in all 7 of the seasons and speak a sufficient amount. Filter the rows to include only those with the speaker name "Cersei", "Daenerys", "Tyrion", and "Arya"--make sure you handle upper-case and lower case properly!

In [176]:
import pandas as pd

def load_dataset(filename):
    '''TODO: Given a filename return a dataframe 
       containing the rows.
       
       Only return those rows with a name:
       * "cersei"
       * "daenerys"
       * "tyrion"
       * "arya"
    '''
    
    df = pd.read_csv(filename, delimiter=';')
    return df[ (df['Name'] == 'cersei') | \
               (df['Name'] == 'tyrion') | \
               (df['Name'] == 'daenerys') | \
               (df['Name'] == 'arya')]

    #raise ValueError("Not Implemented")

### Basic Cluster Analysis
Next, we will mine this dataset to understand what types of structure exist. In the next task, we will write a featurizer that takes the dataset and converts it into a set of feature vectors. We will use a tf-idf featurizer to do this:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [195]:
from sklearn.feature_extraction.text import TfidfVectorizer

def featurize(quotes):
    '''TODO: takes a set of quotes as input and returns two things: an array of feature vectors 
             and the featurizer. 
             
       * Use the tfidfvectorizer from sklearn and remove english stopwords and restrict the features to 
             words that appear in *at most* 20 quotes.
    
       Return values (returns a tuple!!): 
                      X - a dense numpy array of feature vectors representing the text data.
                      vectorizer - a TfidfVectorizer object.
    '''
    
    
    vectorizer = TfidfVectorizer(stop_words='english', max_df=20)
    X = vectorizer.fit_transform(quotes)
    return X.todense(), vectorizer

    #raise ValueError('Not Implemented')
    

df = load_dataset('got.csv')
X, vectorizer = featurize(df['Sentence'])

Now, let's compute the principal components of this featurized dataset:

In [196]:
from sklearn.decomposition import PCA
import numpy as np

def compute_pca(features, components=2):
    '''TODO: Calculate the first two principal components of the 
             features that you have. Return the components, 
             the explained variance, and N-D representation of the
             feature vectors.
       
       Return Values (returns a 4-tuple): 
              * axes (the principal components from .components_)
              * Y (the dimensionality reduced data)
              * c = explained variance on a range from [0,1]
    '''
    
    pca = PCA(n_components=components) #find 2 principal components
    Y = pca.fit_transform(X)

    c = np.sum(pca.explained_variance_ratio_)
    
    return pca.components_, Y, c
    #raise ValueError("Not Implemented")

#compute PCA
pcs, Y, c = compute_pca(X, 3)

Now write code to interpret the PCA components. Write a function that uses the vectorize to determine the words whose presence or absence is strongest in the PC.

In [198]:
def top_k(pc,vectorizer,k=10):
    '''TODO: Finds the highest (most positive) weighted elements in a pc and 
             then returns the words that correspond to those elements. 
             
             Exclude all words that are less than 3 letters.
             
       Return Value: A set of k words
    '''
    
    weights = list(zip(pc, vectorizer.get_feature_names()))
    weights.sort()
    return set([word for _, word in weights[:k] if len(word) > 3])
    #raise ValueError("Not Implemented")

#extract each of the pcs
pc1, pc2, pc3 = pcs
    
print("PC 1: Most Positive: ", top_k(pc1, vectorizer))
print()
print("PC 2: Most Positive: ", top_k(pc2, vectorizer))
print()
print("PC 3: Most Positive: ", top_k(pc3, vectorizer))
print()
print("Explained Variance: ", c)

PC 1: Most Positive:  {'send', 'isnt', 'choice', 'prince', 'ship', 'podrick', 'game', 'talk', 'shae'}

PC 2: Most Positive:  {'ones', 'care', 'drogo', 'sleep', 'promised', 'fine', 'khal', 'havent', 'mean'}

PC 3: Most Positive:  {'sleep', 'promised', 'jorah', 'prince', 'feel', 'kind', 'oberyn', 'save', 'coming', 'whore'}

Explained Variance:  0.008340856769294209


For those of you who know the story, you can see the story arcs in the principal components.

## Predicting The Speaker
Now, we will have you predict the speaker from the patterns in the text. The first step is to define a training and a test set. Write the following function that splits the loaded dataset into a training set (80% of the data) and a test set (20% of the data). The partition should be random.

In [219]:
def train_test_split(dataframe):
    '''TODO: Write a function that splits the dataset into a 
             training set and a testing set
        
       Return values (returns a tuple!) : - A training set 80% of the data, - A test set 20% of the data.
    '''
    mask = (np.random.rand(len(dataframe)) <= 0.8)
    
    return df[mask], df[~mask]

train, test = train_test_split(df)

### TODO. Your task is to build a classifier that will achieve at least 45% accuracy on this dataset
To achieve this you will have to manipulate the data and play around with different featurization techniques and modeling choices. First, write a function that "fits" a language model, such as TFIDF, to the training dataset. It is up to you to tune the parameters for the vectorizer you choose approriately.

In [302]:
def language_model(training_quotes):
    '''TODO: Write a function that instantiates a vectorizer (e.g., a TfidfVectorizer), runs fit() and 
       returns the vectorizer.
    '''
    
    vectorizer = TfidfVectorizer(stop_words='english',max_df=30)
    vectorizer.fit(train['Sentence'])
    
    return vectorizer    
    
    #raise ValueError("Not implemented")

Next, you will write a featurizer that takes in a set of quotes and returns an array of feature vectors using the language model above. You may add whatever additional features you find useful. 

In [305]:
def prediction_featurize(quotes, vectorizer):
    X = vectorizer.transform(quotes)
    return X.todense()

    #raise ValueError("Not Implemented")

Finally, determine the right machine learning model to use to actually make the prediction.

In [308]:
vectorizer = language_model(train['Sentence'])
X = prediction_featurize(train['Sentence'], vectorizer)
Y = train['Name']

Xtest = prediction_featurize(test['Sentence'], vectorizer)
Ytest = test['Name']

##TODO: YOUR CODE HERE (Bad Example That Doesn't Work Well)
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(X,Y)
pred = clf.predict(Xtest)
##END:TODO


#calculate accuracy
from sklearn.metrics import classification_report
print(classification_report(Ytest, pred))



              precision    recall  f1-score   support

        arya       0.42      0.26      0.32       144
      cersei       0.44      0.29      0.35       188
    daenerys       0.38      0.26      0.31       162
      tyrion       0.40      0.65      0.50       282

    accuracy                           0.41       776
   macro avg       0.41      0.36      0.37       776
weighted avg       0.41      0.41      0.39       776

