# Welcome, Bishop Ireton Hackers!

This tutorial leads you through the steps of a basic data mining pipeline:
1. Define your problem
2. Identify your data
3.  ,-------> Explore your data ----------
4.  '-- Normalize and Clean your data <--'
5. Extract information

If you're a pro, feel free to modify as you go or carve your own path.

In [None]:
# Make sure the following libraries work
# To troubleshoot: open the command line and check that it's installed, e.g. "which numpy"
# if it is not installed, simply install e.g. "pip install pandas"
# if it is installed, you might need to check your paths
from IPython.display import Image
import os
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
plt.style.use('ggplot')
import ast
from sklearn import preprocessing


## 1. Define your problem: What makes a good book-to-movie? 
Seriously though, why do so many good books become terrible movies? Take this conundrum, for example:

Mystic River. Written by Dennis Lehane, starring Sean Penn and Tim Robbins, directed by Clint Eastwood. Award-winning book, and award-winning movie.

Live By Night. Written by Dennis Lehane, starring Ben Affleck and Zoe Saldana, directed by Ben Affleck. Award-winning book, TERRIBLE movie. Even after it was adapted to film by the author himself.

I don't expect to solve all the mysteries here, but I'd like to get some general trends.

In [None]:
cwd = os.getcwd()
movie_posters = cwd+"/img/MoviePosters.png"
Image(movie_posters, width=400, height=400)

## 2. Identify your data. 

Let's use a subset of a dataset that I created, using goodreads and imdb.

For book data, I used the goodreads API https://www.goodreads.com/api, which I found pretty robust and easy-to-use, as far as API's go.

For movie data, IMDb has a huge volume of public data. You can download it in large chunks, https://www.imdb.com/interfaces/, or go through the API like I did (http://www.omdbapi.com/).

Holy crap there are a lot of books and movies! To link the two datasets, I went to wikipedia.
https://en.wikipedia.org/w/index.php?title=Category:American_novels_adapted_into_films&pageuntil=Burnt+Offerings+%28Marasco+novel%29#mw-pages


Some other great places to find free datasets:
https://www.kaggle.com/datasets
https://github.com/datasets
https://data.fivethirtyeight.com/
https://datasetsearch.research.google.com/
https://scikit-learn.org/stable/datasets/toy_dataset.html


## 3. Explore your data.

Start by importing the data and examining its structure.

In [None]:
# Data! Data! Data!
film_adaptations = []

print(cwd)

with open(cwd+'/Books-to-Movies.txt', encoding='utf8') as f:
    for line in f.read().splitlines():
        d = (ast.literal_eval(line))
        film_adaptations.append(d)
    
# let's take a look at the information each entry has (the "keys") to get a sense of structure
for key in film_adaptations[0].keys():
    print(key,type(key))
    
# We'll keep this list handy, since we'll be referencing these keys to select our data

In [None]:
# Transform to dataframe
# Let's shift to pandas for more flexible handling and better features (like sample!). 
# to learn more about the pandas data science toolsuite: https://pandas.pydata.org/ 
df = pd.DataFrame(film_adaptations)

# print a row, and print the top (head) of a column
print('A row:',df.sample())

print('/nA column:',df['mActors'].head())


In [None]:
# Formatting the output
# And lets examine some random samples and practice accessing values based on key.
for i in range(5):
    row=df.sample()
    
    print('{} was written by {} in {}, and had an average rating of {}\n'.format(
                        row['title'].values[0],
                        row['author_name'].values[0],
                        row['publication_date'].values[0][-4:],
                        row['average_rating'].values[0]))

In [None]:
# For practice, look up the goodreads id of your favorite film-adapted book
# https://www.goodreads.com/book/
# The goodreads id preceeds the book title in the web address, highlighted below in blue
Image(cwd+"/img/how_to_find_gr_id.png", width=500, height=300)

In [None]:
# Access the row of the desired film adaptation

grID = '41865'  # <------ edit this variable with the goodreads ID you just looked up

row = df.loc[df['goodreads_id'] == grID]

# you can print the whole row, or just extract the key value you want
print('title of row',row['title']) # <---- edit this line to access a different value, like 'average_rating'


In [None]:
# What do other book ratings look like?
df['average_rating'] = df['average_rating'].astype(float)

print('good reads ratings:\n',df['average_rating'].head())

# Plot goodreads ratings
plt.rcParams["figure.figsize"] = (18,3)
df['average_rating'].plot(kind='bar')

## 3. Normalize and clean your data.

Normalizing, cleanding, transforming... These terms are thrown used loosely but have very concrete meanings in practice. For more: https://www.statisticshowto.com/normalized/

If you notice errors or something missing, please please please mention it in the comments so the quality of this dataset can be improved!

In [None]:
#Plot goodreads ratings (average_rating) against imdb ratings (m_imdb_Rating)
#recall all our key values are in string format, so we need to convert to numeric
df['m_imdb_Rating'] = pd.to_numeric(df['m_imdb_Rating'],errors='coerce')

df.plot(x='average_rating',y='m_imdb_Rating',style='o',figsize=(10,10))

In [None]:
# Let's explore the outliers
print('outliers:',df[df['average_rating']< 2.25])

In [None]:
# Although these classics made ok movies, there are really too few book reviews for goodreads
# scores to have any credibility. Let's establish a review count threshold for quality.

df = df[df['ratings_count'].astype(float)>19] # for goodreads


# repeat for imdb, then replot
df['m_imdb_Votes'] = pd.to_numeric(df['m_imdb_Votes'],errors='coerce')
df = df[df['m_imdb_Votes']>19]


df.plot(x='average_rating',y='m_imdb_Rating',style='o',figsize=(10,10))

# plot a Marker for your favorite film adaptaion on top of the other data
# use the grID variable you set up earlier
row = df.loc[df['goodreads_id'] == grID]
plt.plot(row['average_rating'],row['m_imdb_Rating'], marker='x', markersize=15, color="blue")


In [None]:
# Before we move on to determining what is statistically significant any given column, we need to  
# transform each numeric column to be reflective of what is statistically significant

# normalize the data
from sklearn.preprocessing import scale
df['scaled_gr'] = scale(df['average_rating'].astype(float))

# Plot goodreads ratings AFTER normalizing
plt.rcParams["figure.figsize"] = (18,3)
df['scaled_gr'].plot(kind='bar')

# https://bruchez.blogspot.com/2017/12/having-fun-with-imdb-dataset-files.html
# lots of cool data science tutorials on sci kit learn: https://scikit-learn.org/stable/ 

In [None]:
# Let's compare this to what the data would have looked like BEFORE normalizing:

# Plot goodreads ratings WITHOUT normalization
plt.rcParams["figure.figsize"] = (18,3)
df['average_rating'].plot(kind='bar')


## 5. Extract Information

Let's start by analyzing what makes a good film adaptation, vice a bad one.

In [None]:
# Using the ratings information, let's derive whether it was a good adaptation or bad adaptation
# Create two more columns, labeling each row as either a good/bad book, and a good/bad movie
df['scaled_imdb'] = scale(df['m_imdb_Rating'].astype(float))

df.loc[((df['scaled_imdb']>=0) & (df['scaled_gr']<0)),'Adaptation_Category'] = 1 # good movie, bad book
df.loc[((df['scaled_imdb']>=0) & (df['scaled_gr']>=0)),'Adaptation_Category'] = 2 # good movie, good book
df.loc[((df['scaled_imdb']<0) & (df['scaled_gr']>=0)),'Adaptation_Category'] = 3 # bad movie, good book
df.loc[((df['scaled_imdb']<0) & (df['scaled_gr']<0)),'Adaptation_Category'] = 4 # bad movie, bad book

print(df['Adaptation_Category'])

In [None]:
# Chart out some numeric columns
df[['Adaptation_Category','ratings_count','m_imdb_Votes']].groupby("Adaptation_Category").mean()

In [None]:
from collections import Counter
# put the non-numeric data into a numeric format
bad_books_good_movies = df[df['Adaptation_Category']==1]
good_books_good_movies = df[df['Adaptation_Category']==2]
good_books_bad_movies = df[df['Adaptation_Category']==3]
bad_books_bad_movies = df[df['Adaptation_Category']==4]

# Let's grab all the actors from bad books, but good movies (bbgm)
bbgm_Actors = []
for actors_in_a_given_movie in bad_books_good_movies['mActors'].values.tolist():    
    bbgm_Actors += actors_in_a_given_movie.split(', ')
        
print("Top actors for good film adaptations:")
            
print(Counter(bbgm_Actors).most_common(50))


In [None]:
# Let's compare this to all the actors from good movies, but bad books
gbbm_Actors = []
for actors_in_a_given_movie in good_books_bad_movies['mActors'].values.tolist():    
    gbbm_Actors += actors_in_a_given_movie.split(', ')
    
print("Top actors for bad film adaptations (Who ruined the book???):")

print(Counter(gbbm_Actors).most_common(50))


In [None]:
# Any common elements?
intersection = set(bbgm_Actors).intersection(gbbm_Actors)

print(Counter(intersection).most_common(10))

Thanks so much for hacking along, Bishop Ireton! Please see my github project page https://github.com/ravedawg/HackBI to cite this work, leave a comment, follow, or collaborate!