# Exploratory Data Analysis

Now that we have our raw data, the next main goal is getting the data in a form that can be ingested by a model. For this we will need three things:
1. **Good, clean data.** This is the biggest hurdle. Once we've processed the raw data into a form we can actually make use of, we will want to explore its quality. Some things to look out for are: typical length of a summary, how many null summaries or cast lists we have, what proportion have Samuel L. Jackson in the cast, prevalence and occurrence of missing data, etc.
2. **Labels.** In our case, we are looking to classify films as either "staring" Samuel L. Jackson or not. The working definition of "staring" will be that he is included in the cast. This is clearly a coarse definition: the character John Arnold (of "hold on to your butts" fame) portrayed by SLJ is hardly the main character of Jurassic Park.
3. **Features.** For now we will just do minor cleanup of the summaries from OMDb, if anything. The hardcore numeric feature engineering we will save for the machine learning step, so that we can experiment with different numerical enocoding techniques (bag-of-words vs word2vec vs...).


## 1. Data cleaning

In [1]:
import glob
import re

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

# Canonical cookie cutter start...
from dotenv import load_dotenv, find_dotenv
#find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()
# load up the entries as environment variables
load_dotenv(dotenv_path)

True

In [10]:
# files = glob.glob('../data/raw/omdb-movies-20181222/*.csv')
files = glob.glob('../data/raw/omdb-movies-20190122/*.csv')
dfs = [pd.read_csv(f) for f in files]
data_raw = pd.concat(dfs)

In [11]:
# Just looking at some rows...
data_raw[data_raw['actors'].fillna('').str.contains('Samuel L. Jackson')].tail(10)

Unnamed: 0,imdbid,title,year,released,genre,actors,plot,response,year_mismatch
93,tt0113089,Fluke,1995,02 Jun 1995,"Drama, Family, Fantasy","Samuel L. Jackson, Matthew Modine, Nancy Travi...","After a man dies in a car crash, he is reincar...",True,True
158,tt0113552,Kiss of Death,1995,21 Apr 1995,"Action, Crime, Thriller","David Caruso, Samuel L. Jackson, Nicolas Cage,...",Jimmy Kilmartin's an ex-con who's trying to go...,True,True
102,tt0119080,Eve's Bayou,1997,07 Nov 1997,Drama,"Jurnee Smollett-Bell, Meagan Good, Lynn Whitfi...",The story is set in 1962 Louisiana. The Batist...,True,True
145,tt0119256,Hard Eight,1996,28 Feb 1997,"Crime, Drama","Philip Baker Hall, John C. Reilly, Gwyneth Pal...",John has lost all his money. He sits outside a...,True,False
170,tt0119396,Jackie Brown,1997,25 Dec 1997,"Crime, Drama, Thriller","Pam Grier, Samuel L. Jackson, Robert Forster, ...",The middle-aged stewardess Jackie Brown smuggl...,True,True
248,tt0118531,One Eight Seven,1997,30 Jul 1997,"Drama, Thriller","Samuel L. Jackson, John Heard, Kelly Rowan, Cl...",High school teacher Trevor Garfield is stabbed...,True,True
154,tt0116448,The Great White Hype,1996,03 May 1996,"Comedy, Sport","Samuel L. Jackson, Jeff Goldblum, Peter Berg, ...","When the champ's promoter, Reverend Sultan, de...",True,True
157,tt0119256,Hard Eight,1996,28 Feb 1997,"Crime, Drama","Philip Baker Hall, John C. Reilly, Gwyneth Pal...",John has lost all his money. He sits outside a...,True,True
214,tt0116908,The Long Kiss Goodnight,1996,11 Oct 1996,"Action, Crime, Drama, Mystery, Thriller","Geena Davis, Samuel L. Jackson, Yvonne Zima, C...","Samantha Caine, suburban homemaker, is the ide...",True,True
326,tt0117913,A Time to Kill,1996,24 Jul 1996,"Crime, Drama, Thriller","Matthew McConaughey, Sandra Bullock, Samuel L....","In Canton, Mississippi, 10-year-old Tonya Hail...",True,True


A few things jump out from just looking at a few examples:
1. We recorded some flags while collecting data: `response` flags if the OMDb api returned a response or not, and `year_mismatch` flags if there was a mismatch between the OMDb year of a film and the Wikipedia year.
2. There appear to be a few repeated films (e.g., "Hard Eight"). Closer inspection shows that the rows differ in terms of the `year_mismatch` flag. How did this happen?
3. Some rows do not have an `imdbdid`. If my memory serves me correctly, this means that we pulled a film name from wikipedia but did not find results on the OMDb.

In [12]:
# One's with False `response` tend to have punctuation...
missed_titles = data_raw.loc[~data_raw['response'], 'title'].tolist()
# Refactoring the data pulling mechanism only got us 10 more films
len(missed_titles)

581

By searching OMDb for these titles it is clear there are a few issues:
1. Minor differences in name e.g. "Sergant Deadhead" vs "Sergeant Dead Head"
2. Stylization not used by OMDb e.g. "Rx Murder" vs. "Prescription Murder"
3. Typos: "M"A\*S\*H" v.s. "M\*A\*S\*H"
4. Subtitles: "Behemoth, the Sea Monster" vs. "Behemoth"
5. Noncanonical information: "\\$ aka Dollars" vs. "\\$"

To deal with these, we will:
1. Search for the original title
2. If not found, remove non-alpha-numeric characters from the title and try again

In [15]:
pattern = re.compile(r'[^a-zA-Z0-9 ]+')
[pattern.sub('', t) for t in missed_titles[:20]]

['Rx Murder',
 'The Seventh Voyage of Sinbad',
 'The Son of Robin Hood',
 'This Angry Age',
 'Gassss',
 'Monty Walsh',
 'Nams Angels',
 'R P M',
 'City Under the Sea',
 'Die Monster Die',
 'Faster Pussycat Kill Kill',
 'Sergeant Deadhead',
 'aka Dollars',
 'Bless the Beasts and Children',
 'Cains Cutthroats',
 'A Town Called Bastard',
 'Who Killed Mary Whats Er Name',
 'Behemoth the Sea Monster',
 'Crime and Punishment USA',
 'Desert Desperadoes']