# Modeling

#### What type of model should we use?
We wanted to determine which type of model. Because of the format of our data, we wanted to run the model season-by-season, using encoded vectors for the contestants. We will have to use a neural network to make a decision between the contestants. 

In [1]:
import os
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import tensorflow as tf
from tensorflow import keras
import time
from datetime import datetime, timedelta

# Part 1: Importing and formatting data for the model

### We want the data in this format:

Season #

|tweets # (index) | character 1 mentioned | character 2 mentioned | ... | character n mentioned | sentiment analysis positive | sentiment analysis negative | sentiment analysis neutral | sentiment analysis compound| result|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| int | bin | bin | bin | bin | float | float | float | float | Binary vector with len(n)|

In [2]:
# Pulling in final datasets and deleting index columns
master_data = pd.read_csv("./wikipedia_master.csv").drop(columns=["Unnamed: 0"])
twitter_data = pd.read_csv("./twitter_data.csv").drop(columns=["Unnamed: 0"])

# Changing twitter_data seasons to int from float
twitter_data["Season"] = twitter_data["Season"].values.astype(int)

#### Checking the datasets

In [3]:
twitter_data

Unnamed: 0,Text,Season,Date,Sentiment Analysis
0,BYE MARTIN I HOPE THE DOOR HITS YOU ON THE WAY...,18,11/30/21,"{'neg': 0.0, 'neu': 0.791, 'pos': 0.209, 'comp..."
1,Wow just getting caught up on and man Michell...,18,11/30/21,"{'neg': 0.083, 'neu': 0.564, 'pos': 0.353, 'co..."
2,,18,11/30/21,"{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound..."
3,yo fuck Martin toxic ass bitch,18,11/30/21,"{'neg': 0.783, 'neu': 0.217, 'pos': 0.0, 'comp..."
4,Martin walked so Chris S Could run,18,11/30/21,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
...,...,...,...,...
6324,WAIT WAIT MEDICAL MALPRACTICE DEFENSE ATTORNEY...,16,11/16/20,"{'neg': 0.0, 'neu': 0.791, 'pos': 0.209, 'comp..."
6325,Taysha Is Pretty,16,11/16/20,"{'neg': 0.0, 'neu': 0.385, 'pos': 0.615, 'comp..."
6326,Kaitlyn Bristowe Shows Off DWTS Injuries Tv Sh...,16,11/16/20,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
6327,Love a good Wells cameo,16,11/16/20,"{'neg': 0.0, 'neu': 0.099, 'pos': 0.901, 'comp..."


In [4]:
master_data

Unnamed: 0,Season,Episode,Title,Date Aired,Contestants,Voted Off
0,9,1,Week 1,5/27/13,['Chris Siegfried' 'Drew Kenney' 'Brooks Fores...,"['Diogo Custodio', 'Larry Burchett', 'Micah He..."
1,9,2,Week 2,6/3/13,['Chris Siegfried' 'Drew Kenney' 'Brooks Fores...,"['Nick Mucci', 'Robert Graham', 'Will Reese']"
2,9,3,Week 3,6/10/13,['Chris Siegfried' 'Drew Kenney' 'Brooks Fores...,"['Brandon Andreen', 'Dan Cox', 'Brian Jarosins..."
3,9,4,Week 4,6/17/13,['Chris Siegfried' 'Drew Kenney' 'Brooks Fores...,"['Zack Kalter', 'Brad McKinzie']"
4,9,5,Week 5,6/24/13,['Chris Siegfried' 'Drew Kenney' 'Brooks Fores...,"['Mikey Tenerelli', 'Ben Scott', 'Bryden Vukas..."
...,...,...,...,...,...,...
95,18,4,Week 4,11/9/21,['Brandon Jones' 'Joe Coleman' 'Nayte Olukoya'...,"['Chris Gallant', 'Romeo Alexander', 'Will Ure..."
96,18,5,Week 5,11/16/21,['Brandon Jones' 'Joe Coleman' 'Nayte Olukoya'...,"['Casey Woods', 'Leroy Arthur', 'Chris Sutton']"
97,18,6,Week 6,11/23/21,['Brandon Jones' 'Joe Coleman' 'Nayte Olukoya'...,"['Olumide ""Olu"" Onajide', 'Rick Leach', 'Marti..."
98,18,7,Week 7,11/30/21,['Brandon Jones' 'Joe Coleman' 'Nayte Olukoya'...,['Rodney Matthews']


#### We need to clean the data I imported from permanent, .csv storage as dtypes were not preserved. 

In [5]:
# Wikipedia Data Parsing

dates = master_data["Date Aired"]
dates_new = []
conts = master_data["Contestants"]
conts_new = []
vo = master_data["Voted Off"]
vo_new = []

# Getting dates back into datetime format
for i in range(len(dates)):
    dates_new.append(datetime.strptime(dates[i], "%m/%d/%y"))
    
# getting the list of contestants as a list
for i in range(len(conts)):
    temp_conts = conts[i][2:-2]
    temp_conts = temp_conts.replace("\n", "")
    temp_conts = temp_conts.split("' '")
    conts_new.append(temp_conts)
    
# getting list of voted off as a list
for i in range(len(vo)):
    temp_vo = vo[i][2:-2]

    temp_vo = temp_vo.replace("\n", "")
    temp_vo = temp_vo.split("', '")
    if temp_vo == ['']:
        temp_vo = ["0"]
    vo_new.append(temp_vo)
 

data = {"Date Aired": dates_new, "Contestants": conts_new, "Voted Off": vo_new}

master_data.drop(columns=["Date Aired", "Contestants", "Voted Off"], inplace=True)

master_data = master_data.join(pd.DataFrame(data=data))

In [6]:
# Twitter Data Parsing

dates = twitter_data["Date"]
dates_new = []
scores = twitter_data["Sentiment Analysis"]
scores_new = []

# Getting dates back into datetime format
for i in range(len(dates)):
    dates_new.append(datetime.strptime(dates[i], "%m/%d/%y"))
    
# getting the sentiment analysis scores as a dictionary
for i in range(len(scores)):
    sc = scores[i][1:-1].split(", ")
    temp_dict = {"neg": None, "neu": None, "pos": None, "compound": None}
    temp_list = []
    for entry in sc:
        score = float(entry.split(": ")[1])
        temp_list.append(score)
    temp_dict["neg"] = temp_list[0]
    temp_dict["neu"] = temp_list[1]
    temp_dict["pos"] = temp_list[2]
    temp_dict["compound"] = temp_list[3]
    scores_new.append(temp_dict)

# Creating a new dictionary with the data extracted in the correct datatype    
data = {"Date": dates_new, "Sentiment Analysis": scores_new}

# Dropping the columns to replace from the current dataframe
twitter_data.drop(columns=["Date", "Sentiment Analysis"], inplace=True)

# Adding the new data in place of the old data
twitter_data = twitter_data.join(pd.DataFrame(data=data))


#### We have to now extract all the data we need which we will put into a dictionary of dataframes with each dataframe having the model-ready data for each season

In [7]:
from collections import defaultdict

# Establishing a useful dictionary to be used for the rest of the code
all_data = {9:None, 10:None, 11:None, 12:None, 13:None,  
        14:None, 15:None, 16:None, 18:None}

tweets_per_season = []

# For each season
for season in all_data.keys():
    
    # Get subsets of the dataframes and establish a season dictionary
    season_data = defaultdict()
    twitter_subset = twitter_data[twitter_data["Season"] == season]
    wikipedia_subset = master_data[master_data["Season"] == season]

    # for each tweet that happened that season
    for i in range(len(twitter_subset)):
        episode_data = defaultdict()
        
        # Get the sentiment analysis scores and add them to the season dictionary
        episode_data["Positive Scores"] = twitter_subset.iat[i, 3]["pos"]
        episode_data["Neutral Scores"] = twitter_subset.iat[i, 3]["neu"]
        episode_data["Negative Scores"] = twitter_subset.iat[i, 3]["neg"]
        episode_data["Compound Scores"] = twitter_subset.iat[i, 3]["compound"]
        
        # extract the date of the tweet
        tweet_date = twitter_subset.iat[i, 2]
        
        # initialize an episode_date variable for use in comparing tweet date with the episode date
        episode_date = None
        
        # If it was live tweeted, then use the date of the tweet as the episode date
        for date in wikipedia_subset["Date Aired"]:
            if tweet_date == date:
                episode_date = tweet_date
                break
            else:
                # if it was not live tweeted, step back a few days until the nearest episode and use that episode
                for d in range(7):
                    td = timedelta(days=d)
                    date_forward = date + td
                    if tweet_date == date_forward:
                        episode_date = date

        # get who was voted off for that tweet, along with some gross formatting fixes           
        voted_off = list(wikipedia_subset[wikipedia_subset["Date Aired"] == episode_date]["Voted Off"])
        if voted_off == []:
            voted_off = [["0"]]
        voted_off = voted_off[0]
        
        # establish the value n which will define the number of extra columns for the encoded-vectors of the contestants
        contestants = wikipedia_subset.iat[0, 4]
        n = len(contestants)
        
        # getting a list of first names of the contestants for crossreferencing with tweet mentions
        contestants_first_names = set([name.split(" ")[0].lower() for name in contestants])
        
        # extracting the text of the tweets as a list of words
        tweet_text_split = str(twitter_subset.iat[i, 0]).split(" ")
        tweet_text_split = set([word.lower() for word in tweet_text_split])
        
        # extracting the mentions of characters from the tweets
        mentions = tweet_text_split.intersection(contestants_first_names)
        
        # creating vectors for mentions with length n (number of characters)
        for name in contestants_first_names:
            if name in mentions:
                episode_data[name] = 1
            else:
                episode_data[name] = 0
                
        # creating the softmax vector for y-data in the model
        vec = []
        for name in contestants:
            if name in voted_off:
                vec.append(1)
            else:
                vec.append(0)
        
        # adding the final vector to the episode_data dictionary
        episode_data["Voted Off"] = vec

        # if this is the first tweet, add the keys of episode_data to the season_data and make them lists
        # if not, append the values from that episode to  season_data lists
        if i == 0:
            for key in episode_data.keys():
                season_data[key] = []
        else:
            for key in episode_data.keys():
                season_data[key].append(episode_data[key])
    
    # Using season_data, create a Dataframe and add it to the all_data dictionary
    all_data[season] = pd.DataFrame(data=season_data)
    
    


### Now we have a dictionary of DataFrames which are formatted correctly for a model

In [8]:
# Checking that the data is in the format we want!
all_data[9].head(5)

Unnamed: 0,Positive Scores,Neutral Scores,Negative Scores,Compound Scores,brad,michael,juan,zak,brooks,kasey,...,jonathan,larry,brian,will,nick,zack,mikey,bryden,mike,Voted Off
0,0.0,1.0,0.0,0.0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,0.0,1.0,0.0,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,0.314,0.686,0.0,0.4939,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,0.425,0.575,0.0,0.5709,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,0.326,0.674,0.0,0.4404,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [9]:
# Saving the data to .csv files
for key in all_data.keys():
    all_data[key].to_csv("./model_data/season_%i_model_data.csv"%key)

# Part 2: Model development

#### We wanted to use a basic 20-neuron, three layer neural network using keras to predict a length-n output vector which can be mapped to contestant names

In [10]:
from keras.models import Sequential
from keras.layers import Dense, Activation

# Creating a function which returns the model of the correct architecture and dimensions
def make_model(n_cols, n_contestants):
    model = Sequential()
    model.add(Dense(20, input_dim=n_cols, activation = 'relu'))
    model.add(Dense(20, activation = 'relu'))
    model.add(Dense(n_contestants, activation='softmax'))
    
    # Optimize with stochastic gradient descent, with the loss function as mean squared error
    model.compile(optimizer='sgd',
                  loss='mse',
                  metrics=['accuracy'])

    return model


In [11]:
from tensorflow.data import Dataset

# season list for iteratively running the model and an accuracy list for permanently storing model accuracies
seasons = [9, 10, 11, 12, 13, 14, 15, 16, 18]
accuracy = []

for season in seasons:
    # Useful variables for use in model creation
    n_cols = len(all_data[season].columns)-1
    n_contestants = len(all_data[season].iloc[0, -1])
    
    # set up datasets for training
    x = tf.cast(all_data[season].iloc[:, :-1], tf.float32)                   # Casting all of the matrix data as a float
    y = np.array([np.array(vec) for vec in all_data[season].iloc[:, -1]])    # Casting the target vector as a 2D vector of ndarrays

    # create the model based on the season (i.e. number of columns and contestants)
    model = make_model(n_cols, n_contestants)
    
    
    
    # Fitting the model to the season's data
    print("\nSeason %s" % season)
    model.fit(x, y, epochs=25, batch_size=10)
    
    # Getting training accuracy only from the model 
    # We did not have enough data to get both training and testing accuracies
    scores = model.evaluate(x, y)
    accuracy.append(scores[1])



Season 9
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25

Season 10
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25

Season 11
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25

Season 12
Epoch 1/25
Epoch 2/25
Epoch 3/25


Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25

Season 13
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25

Season 14
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25

Season 15
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25


Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25

Season 16
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25

Season 18
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


# Part 3: Model Analysis:

#### Here is a description of the model architecture. Note: the final Dense layer's shape is dependent on how many contestants we have in the season. The model below represents Season 18, which had 30 contestants, and thus the model had a 30-Neuron, softmax output layer.

In [12]:
model.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_24 (Dense)             (None, 20)                660       
_________________________________________________________________
dense_25 (Dense)             (None, 20)                420       
_________________________________________________________________
dense_26 (Dense)             (None, 30)                630       
Total params: 1,710
Trainable params: 1,710
Non-trainable params: 0
_________________________________________________________________


#### Accuracies from the model:

Below show the seasons and the accuracy of the models associated with the seasons. It is clear that, while our accuracies are not too high, the model does better than guessing, especially because the output layer ranged from 25-30 neurons, and the likelihood choosing the correct 1-3 people out of that large of a group is incredibly small.  

In [13]:
for season, acc in zip(seasons, accuracy):
    print("Season %.0i accuracy: %.04f"% (season, acc))

Season 9 accuracy: 0.0197
Season 10 accuracy: 0.0396
Season 11 accuracy: 0.0044
Season 12 accuracy: 0.0848
Season 13 accuracy: 0.0514
Season 14 accuracy: 0.0863
Season 15 accuracy: 0.0000
Season 16 accuracy: 0.1292
Season 18 accuracy: 0.0000
