# Welcome, PyLadies!

This tutorial is intended to demo a basic data mining pipeline:
1. Define your problem
2. Identify your data
3.  ,-> Explore your data ------------------,
4.  '-- Normalize and Clean your data <--'
5. Extract information

If you're a pro, feel free to modify as you go or carve your own path. Everything you need outside of open source can be found here: https://github.com/ravedawg/PyLadies/ If you share, please cite. Thanks!

In [None]:
# Make sure the following libraries work
# To troubleshoot: open the command line and check that it's installed, e.g. "which numpy"
# if it is not installed, simply install e.g. "pip install pandas"
# if it is installed, you might need to check your paths
from IPython.display import Image
import os
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
plt.style.use('ggplot')
import ast
from sklearn import preprocessing
from sklearn.preprocessing import scale

## 1. Define your problem: What makes a good book-to-movie? 
Seriously though, why do so many good books become terrible movies? Take this conundrum, for example:

Mystic River. Written by Dennis Lehane, starring Sean Penn and Tim Robbins, directed by Clint Eastwood. Award-winning book, and award-winning movie.

Live By Night. Written by Dennis Lehane, starring Ben Affleck and Zoe Saldana, directed by Ben Affleck. Award-winning book, sh!t movie. Even after it was adapted to film by the author himself.

I don't expect to solve all the mysteries here, but I'd like to get some general trends.

In [None]:
cwd = os.getcwd()
movie_posters = cwd+"/img/MoviePosters.png"
Image(movie_posters, width=400, height=400)

## 2. Identify your data. 

Let's use a subset of a dataset that I created, using goodreads and imdb.

For book data, I used the goodreads API https://www.goodreads.com/api, which I found pretty robust and easy-to-use, as far as API's go.

For movie data, IMDb has a huge volume of public data. You can download it in large chunks, https://www.imdb.com/interfaces/, or go through the API like I did (http://www.omdbapi.com/).

Holy crap there are a lot of books and movies! To link the two datasets, I went to wikipedia.

https://en.wikipedia.org/w/index.php?title=Category:American_novels_adapted_into_films&pageuntil=Burnt+Offerings+%28Marasco+novel%29#mw-pages

## 3. Explore your data.

Start by importing the data and examining its structure.

In [None]:
# Data! Data! Data!
film_adaptations = []
with open('/Users/ravenholm/Books-to-Movies.txt') as f:
    for line in f.read().splitlines():
        d = (ast.literal_eval(line))
        film_adaptations.append(d)
    
# let's take a look at the keys to get a sense of structure
for key in film_adaptations[0].keys():
    print(key,type(key))

In [None]:
# Let's shift to pandas for more flexible handling and better features (like sample!).
df = pd.DataFrame(film_adaptations)

# print a row, and print the top (head) of a column





In [None]:
# And lets examine some random samples and practice accessing values based on key.
for i in range(5):
    row=df.sample()
    
    print('{} was written by {} in {}, and had an average rating of {}\n'.format(
                        row['title'].values[0],
                        row['author_name'].values[0],
                        row['publication_date'].values[0][-4:],
                        row['average_rating'].values[0]))

In [None]:
# For practice, look up the goodreads id of your favorite film-adapted book
# https://www.goodreads.com/book/
# The goodreads id preceeds the book title in the web address, highlighted below in blue
Image(cwd+"/img/how_to_find_gr_id.png", width=500, height=300)

In [None]:
# Access the row of the desired film adaptation

grID = '41865'  # <------ edit this variable with the goodreads ID you just looked up

print('what the goodreads_id column looks like:', df['goodreads_id'].head())

row = df.loc[df['goodreads_id'] == grID]

# add a line to display other data about the film that might tell you whether or not it was 
# a good film adaptation.






In [None]:
# What do other book ratings look like?
df['average_rating'] = df['average_rating'].astype(float)




# Plot goodreads ratings
plt.rcParams["figure.figsize"] = (18,3)
df['average_rating'].plot(kind='bar')

## 3. Normalize and clean your data.

Normalizing, cleanding, transforming... These terms are thrown used loosely but have very concrete meanings in practice. For more: https://www.statisticshowto.com/normalized/

If you notice errors or something missing, please please please mention it in the comments so the quality of this dataset can be improved!

In [None]:
# Make sure numeric columns are formatted as numeric
df['ratings_count'] = df['ratings_count'].astype(float)
df['m_imdb_Rating'] = pd.to_numeric(df['m_imdb_Rating'],errors='coerce')
df['m_imdb_Votes'] = pd.to_numeric(df['m_imdb_Votes'],errors='coerce')

# convert date time objects
df['publication_date'] = pd.to_datetime(df['publication_date'], errors='coerce')
df['mReleased'] = pd.to_datetime(df['mReleased'], errors='coerce')


In [None]:
#Plot goodreads ratings against imdb ratings
df.plot(x='average_rating',y='m_imdb_Rating',style='o',figsize=(10,10))

In [None]:
# Let's explore the outliers





In [None]:
# Although these classics made ok movies, there are really too few book reviews for goodreads
# scores to have any credibility. Let's establish a review count threshold for quality.

df = df[df['ratings_count'].astype(float)>19] # for goodreads

# add a similar threshold for imdb ratings, then replot




# add a point to indicate your favorite adaptation to see where it falls
grID = grID
row = df.loc[df['goodreads_id'] == grID]
plt.plot(row['average_rating'],row['m_imdb_Rating'], marker='x', markersize=15, color="blue")

In [None]:
# Before we move on to determining what is statistically significant across columns, we need to  
# transform each numeric column to be reflective of what is statistically significant

# normalize the data
df['scaled_gr'] = scale(df['average_rating'].astype(float))

# Plot goodreads ratings
plt.rcParams["figure.figsize"] = (18,3)
df['scaled_gr'].plot(kind='bar')

In [None]:
# Repeat for the imdb ratings




# Replot with your point and see if it changes anything
df.plot(x='scaled_gr',y='scaled_imdb',style='o',figsize=(10,10))
row = df.loc[df['goodreads_id'] == grID]

plt.plot(row['scaled_gr'],row['scaled_imdb'], marker='x', markersize=15, color="blue")
plt.ylabel('scaled imdb ratings')
plt.legend(['Adaptations',str(row['title'])])

## 5. Extract Information

Let's start by analyzing what makes a good film adaptation, vice a bad one.

In [None]:
# Create two more columns, labeling each row as either a good/bad book, and a good/bad movie
df.loc[((df['scaled_imdb']>=0) & (df['scaled_gr']<0)),'Adaptation_Category'] = 1 # good movie, bad book
df.loc[((df['scaled_imdb']>=0) & (df['scaled_gr']>=0)),'Adaptation_Category'] = 2 # good movie, good book
df.loc[((df['scaled_imdb']<0) & (df['scaled_gr']>=0)),'Adaptation_Category'] = 3 # bad movie, good book
df.loc[((df['scaled_imdb']<0) & (df['scaled_gr']<0)),'Adaptation_Category'] = 4 # bad movie, bad book

print(df['Adaptation_Category'])

In [None]:
# Chart out the numeric columns
df[['Adaptation_Category','ratings_count','m_imdb_Votes']].groupby("Adaptation_Category").mean()

goodreads_id <class 'str'>
title <class 'str'>
average_rating <class 'str'>
ratings_count <class 'str'>
publication_date <class 'str'>
author_name <class 'str'>
author_id <class 'str'>
wiki_book_link <class 'str'>
mTitle <class 'str'>
mRated <class 'str'>
mReleased <class 'str'>
mLength <class 'str'>
mGenre <class 'str'>
mDirector <class 'str'>
mWriters <class 'str'>
mActors <class 'str'>
mAwards <class 'str'>
m_imdb_Rating <class 'str'>
m_imdb_Votes <class 'str'>
m_imdb_ID <class 'str'>
mStudio <class 'str'>
mPlot <class 'str'>

Thanks so much for attending! Please see my github https://github.com/ravedawg/PyLadies/ to leave a comment, to follow, or to collaborate!