# Lab 3.08

In this lab, you will be using the Sakila database of movie rentals.

In order to optimize our inventory, we would like to know which films will be rented next month and we are asked to create a model to predict it.
Instructions

- Create a query or queries to extract the information you think may be relevant for building the prediction model. It should include some film features and some rental features. Use the data from 2005.
- Create a query to get the list of films and a boolean indicating if it was rented last month (August 2005). This would be our target variable.
- Read the data into a Pandas dataframe.
- Analyze extracted features and transform them. You may need to encode some categorical variables, or scale numerical variables.
- Create a logistic regression model to predict this variable from the cleaned data.
- Evaluate the results.


## 1. Create a query or queries to extract the information you think may be relevant for building the prediction model. It should include some film features and some rental features. Use the data from 2005.

This is the info I think I need: 

#### (Later relization: film_id and rental_id could have been dropped here, as I don't really need them anymore)

## 2. Create a query to get the list of films and a boolean indicating if it was rented last month (August 2005). This would be our target variable.

SELECT f.film_id, 
       f.title, 
       f.rental_rate,
       f.length,
       f.rating,
       c.name AS category,
       r.rental_id, 
       r.rental_date, 
       r.return_date, 
       CASE WHEN MONTH(r.rental_date) = 8 
            THEN "True"
            ELSE "False"
            END AS rented_last_month 
       FROM film f
JOIN inventory USING(film_id)
JOIN rental r USING(inventory_id)
JOIN film_category fc ON f.film_id = fc.film_id
JOIN category c ON fc.category_id = c.category_id;

(Another later realization: what I do here is to check for every *copy* of the movie whether it was rented last month. But it would make more sense to get that info for every *movie*)

#### HOWEVER: exploring this in MySQL Workbench showed me that every movie, at least one copy has been rented last month - this returns one 'True' and one 'False' row for every movie. So this is not very helpful for regression, to say the least.

## 3. Read the data into a Pandas dataframe.

In [7]:
# THIS WAS THE FIRST PASS, WHERE I DID IT WRONG


import pymysql
from sqlalchemy import create_engine
import pandas as pd
import getpass
import numpy as np
password = getpass.getpass()

# get the data
connection_string = 'mysql+pymysql://root:' + password + '@localhost/sakila'
engine = create_engine(connection_string)
query = '''SELECT f.film_id, 
       f.title, 
       f.rental_rate,
       f.length,
       f.rating,
       c.name AS category,
       r.rental_id, 
       r.rental_date, 
       r.return_date, 
       CASE WHEN MONTH(r.rental_date) = 8 
            THEN "True"
            ELSE "False"
            END AS rented_last_month 
       FROM film f
JOIN inventory USING(film_id)
JOIN rental r USING(inventory_id)
JOIN film_category fc ON f.film_id = fc.film_id
JOIN category c ON fc.category_id = c.category_id;'''

data = pd.read_sql_query(query, engine)
data.shape

········


(16045, 10)

In [33]:
# THIS IS THE RIGHT ONE: SAME THING, BUT WITHOUT ALL THE RENTAL DATES AND IDs, AND WITH SELECT DISTINCT

import pymysql
from sqlalchemy import create_engine
import pandas as pd
import getpass
password = getpass.getpass()

# get the data
connection_string = 'mysql+pymysql://root:' + password + '@localhost/sakila'
engine = create_engine(connection_string)
query = '''SELECT DISTINCT f.film_id, 
       f.title, 
       f.rental_rate,
       f.length,
       f.rating,
       c.name AS category,
       CASE WHEN MONTH(r.rental_date) = 5
            THEN "True"
            ELSE "False"
            END AS rented_last_month 
       FROM film f
JOIN inventory USING(film_id)
JOIN rental r USING(inventory_id)
JOIN film_category fc ON f.film_id = fc.film_id
JOIN category c ON fc.category_id = c.category_id;'''

data = pd.read_sql_query(query, engine)
data.shape

········


(1644, 7)

This gives 958 unique films, although there are 1000 films in the table 'films'. The reason for this is the joins: there are 42 films in the 'films' table which are not in the inventory.

## 4. Analyze extracted features and transform them. You may need to encode some categorical variables, or scale numerical variables.

In [39]:
data.tail(60)

Unnamed: 0,film_id,title,rental_rate,length,rating,category,rented_last_month
1584,294,EXPECATIONS NATURAL,4.99,138,PG-13,Travel,False
1585,294,EXPECATIONS NATURAL,4.99,138,PG-13,Travel,True
1586,299,FACTORY DRAGON,0.99,144,PG-13,Travel,True
1587,299,FACTORY DRAGON,0.99,144,PG-13,Travel,False
1588,307,FELLOWSHIP AUTUMN,4.99,77,NC-17,Travel,True
1589,307,FELLOWSHIP AUTUMN,4.99,77,NC-17,Travel,False
1590,339,FROGMEN BREAKING,0.99,111,R,Travel,False
1591,342,FUGITIVE MAGUIRE,4.99,83,R,Travel,False
1592,347,GAMES BOWFINGER,4.99,119,PG-13,Travel,True
1593,347,GAMES BOWFINGER,4.99,119,PG-13,Travel,False


In [50]:
# Now I want to clean this up: for every movie where a 'True' exists I want to drop the 'False' column. 

# The first way that occurs to me is to generate a new list from all of the rows. 

newlist = []
for i in data.index[:-1]: # I need the slicing to also get the last one working (else it'll iterate further than the last index and break)

    # The first one in a set of two gets a 'True' (there are only sets where one is true, because of SELECT DISTINCT in SQL)
    if data['title'][i] == data['title'][i+1]: 
        newlist.append("True")
        
    # The second one in a set gets passed over - nothing gets added
    elif data['title'][i] == data['title'][i-1]:
        pass
    
    # And if something is neither the first nor the second one in a set, it is itself either true or false:
    else: newlist.append(data['rented_last_month'][i])
        


958


In [58]:
# Next, I'll remove all the duplicate titles from my data: 

data = data.drop_duplicates(subset='title')


In [59]:
# And then, set the 'newlist' with Booleans into the dataframe, as 'rented_last_month'

data['rented_last_month'] = newlist

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['rented_last_month'] = newlist


In [10]:
data.describe()

Unnamed: 0,film_id,rental_rate,length,rental_id
count,16045.0,16045.0,16045.0,16045.0
mean,501.077719,2.942509,114.969274,8025.871611
std,288.531551,1.649698,40.10175,4633.066013
min,1.0,0.99,46.0,1.0
25%,255.0,0.99,81.0,4014.0
50%,496.0,2.99,114.0,8026.0
75%,753.0,4.99,148.0,12038.0
max,1000.0,4.99,185.0,16050.0


In [12]:
data.dtypes

film_id                       int64
title                        object
rental_rate                 float64
length                        int64
rating                       object
category                     object
rental_id                     int64
rental_date          datetime64[ns]
return_date          datetime64[ns]
rented_last_month            object
dtype: object

In [14]:
data.isna().sum()

film_id                0
title                  0
rental_rate            0
length                 0
rating                 0
category               0
rental_id              0
rental_date            0
return_date          184
rented_last_month      0
dtype: int64

## 5. Create a logistic regression model to predict this variable from the cleaned data.

## 6. Evaluate the results.