# Studio: Working with Databases in Python

For today's studio, we will be using the [TV Shows dataset](https://www.kaggle.com/ruchi798/tv-shows-on-netflix-prime-video-hulu-and-disney) from Kaggle. We have already downloaded the CSV for you.

You will use the watchlist you created to answer these questions:

1. **Which streaming services contain the shows you want to watch next?**
2. **Which streaming service is the best value based on the shows you want to watch?**

As you complete the different tasks in the studio, you may choose between using Pandas or SQL. 

**Remember**: we learned in our prep work that one is oftentimes more efficient at certain tasks than the other, so choose wisely!

## My Watchlist

If you would like, please use this space to make note of your watchlist by editing the text cell. You will need 10 shows overall.

1. Money Heist
2. Max Steel
3. Frontier
4. Big Boss
5. Pandian Stores
6. Spaced Out
7. Doc McStuffins
8. The Slap
9. My Favorite Martian
10. Community


## Database Setup

Import the necessary libraries and create a dataframe from the provided CSV. 

Print the info out for the dataframe. 

After that, you may drop the column called `Unnamed: 0` and rename any columns with spaces or unusual characters in the names such as `"Disney+"`. 

Print out the info for the dataframe again to ensure your changes were made.

In [8]:
# Code here
import pandas as pd
import sqlite3
tv_shows_df = pd.read_csv("tv_shows.csv")
tv_shows_df.info()
tv_shows_df.drop('Unnamed: 0',axis=1,inplace=True)
tv_shows_df.info()
tv_shows_df.rename(columns={'Disney+': 'Disney'}, inplace=True)
#print(tv_shows_df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5368 entries, 0 to 5367
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       5368 non-null   int64 
 1   ID               5368 non-null   int64 
 2   Title            5368 non-null   object
 3   Year             5368 non-null   int64 
 4   Age              3241 non-null   object
 5   IMDb             4406 non-null   object
 6   Rotten Tomatoes  5368 non-null   object
 7   Netflix          5368 non-null   int64 
 8   Hulu             5368 non-null   int64 
 9   Prime Video      5368 non-null   int64 
 10  Disney+          5368 non-null   int64 
 11  Type             5368 non-null   int64 
dtypes: int64(8), object(4)
memory usage: 503.4+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5368 entries, 0 to 5367
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ID       

With your dataframe at the ready, create a new database called `tv.db`. 

Add a new table to your database called `shows` using the data in the dataframe. 

In [9]:
# Code here
tv_db= sqlite3.connect("tv.db")
tv_shows_df.to_sql('shows', tv_db, if_exists='replace', index=False)




5368

With your new table and database set up, print out the top 20 records in the `shows` table.

In [10]:
# Code Here
#with tv_db:
	#Write your code here
    #for row in tv_db.execute("SELECT * FROM shows LIMIT 20"):
        #print(row)

query = "SELECT * FROM shows LIMIT 20"
top_20_rows = pd.read_sql(query,tv_db)
top_20_rows

Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney,Type
0,1,Breaking Bad,2008,18+,9.4/10,100/100,1,0,0,0,1
1,2,Stranger Things,2016,16+,8.7/10,96/100,1,0,0,0,1
2,3,Attack on Titan,2013,18+,9.0/10,95/100,1,1,0,0,1
3,4,Better Call Saul,2015,18+,8.8/10,94/100,1,0,0,0,1
4,5,Dark,2017,16+,8.8/10,93/100,1,0,0,0,1
5,6,Avatar: The Last Airbender,2005,7+,9.3/10,93/100,1,0,1,0,1
6,7,Peaky Blinders,2013,18+,8.8/10,93/100,1,0,0,0,1
7,8,The Walking Dead,2010,18+,8.2/10,93/100,1,0,0,0,1
8,9,Black Mirror,2011,18+,8.8/10,92/100,1,0,0,0,1
9,10,The Queen's Gambit,2020,18+,8.6/10,92/100,1,0,0,0,1


Now, create a new table called `watchlist` that has three fields:
1. id -> data type of `INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT`
2. title -> data type of `TEXT`
3. importance_rank -> data type of `INTEGER`

For the `importance_rank` field, rank each of your watchlist shows based on how much you want to see them, `10` being the most important and `1` being the least important.

Then, insert each of the items from your watchlist into the new `watchlist` table, using the `executemany` method from our exercises.

Finally, select all the records from the `watchlist` table and print them out to the console.

In [16]:
# Code here
with tv_db:
    tv_db.executescript("""
         BEGIN;
         DROP TABLE IF EXISTS watchlist;
         CREATE TABLE watchlist(id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT, Title TEXT, importance_rank INTEGER);
         COMMIT;
    """)

sql = 'INSERT INTO watchlist (Title, importance_rank) values(?, ?)'
data = [
    ('Money Heist', 10),
    ('Max Steel', 9),
    ('Frontier', 8),
    ('Big Boss', 7),
    ('Pandian Stores', 6),
    ('Spaced Out', 5),
    ('Doc McStuffins', 4),
    ('The Slap', 3),
    ('My Favorite Martian', 2),
    ('Community', 1)
]
with tv_db:
    tv_db.executemany(sql, data)

query = "SELECT * FROM watchlist"
watchlist_items = pd.read_sql(query,tv_db)
watchlist_items

Unnamed: 0,id,Title,importance_rank
0,1,Money Heist,10
1,2,Max Steel,9
2,3,Frontier,8
3,4,Big Boss,7
4,5,Pandian Stores,6
5,6,Spaced Out,5
6,7,Doc McStuffins,4
7,8,The Slap,3
8,9,My Favorite Martian,2
9,10,Community,1


## Working with the Data

Using Pandas or SQL, find the answer to these 2 questions:
1. How many of the total shows (full csv list) are on each streaming service?
2. What percentage of these total shows is available on each streaming service?

**Hint**:

Use the pandas `query` method to filter the data, and then the Python `len` method to find it's length. [Relevant Link](https://www.geeksforgeeks.org/ways-to-filter-pandas-dataframe-by-column-values/)

In [12]:
# Code here
#total shows in each streaming service
total_shows_netflix = tv_shows_df.query('Netflix == 1')
print("Total Shows in Netflix :\n",len(total_shows_netflix))
total_shows_hulu = tv_shows_df.query('Hulu == 1')
print("Total Shows in Hulu :\n",len(total_shows_hulu))
total_shows_prime = tv_shows_df.query('`Prime Video` == 1')
print("Total Shows in Prime Video :\n",len(total_shows_prime))
total_shows_disney = tv_shows_df.query('Disney == 1')
print("Total Shows in Disney :\n",len(total_shows_disney))
total_shows = len(tv_shows_df)
# percentage of total shows on each streaming service
percentage_netflix = (len(total_shows_netflix) / total_shows) * 100 
print("Percentage of Netflix shows:\n",round(percentage_netflix,2))
percentage_hulu = (len(total_shows_hulu) / total_shows) * 100 
print("Percentage of Hulu shows:\n",round(percentage_hulu,2))
percentage_prime = (len(total_shows_prime) / total_shows) * 100 
print("Percentage of Prime Video shows:\n",round(percentage_prime,2))
percentage_disney = (len(total_shows_disney) / total_shows) * 100 
print("Percentage of Disney shows:\n",round(percentage_disney,2))

Total Shows in Netflix :
 1971
Total Shows in Hulu :
 1621
Total Shows in Prime Video :
 1831
Total Shows in Disney :
 351
Percentage of Netflix shows:
 36.72
Percentage of Hulu shows:
 30.2
Percentage of Prime Video shows:
 34.11
Percentage of Disney shows:
 6.54



Now join your `watchlist` data to the `shows` data using pandas or SQL. Verify that you joined the data correctly.

Using this related dataset, come up with analytic code that answers these questions:
1. The number of watchlist shows each streaming service has
2. The percentage of your overall watchlist each streaming service has


In [31]:
# Code here
with tv_db:
    cur = tv_db.cursor()
    cur.execute("""
        SELECT shows.Title, 
               shows.Netflix,
               shows.Hulu,
               shows."Prime Video",
               shows.Disney FROM shows
        JOIN watchlist
        ON shows.Title = watchlist.Title;
     """)
rows = cur.fetchall()
netflix_count = hulu_count = prime_count = disney_count = 0
total_count = 10
match_count = len(rows)
print("Watchlist shows available for Streaming: ", match_count)

for row in rows:
    print(row)
    if (row[1] == 1): 
        netflix_count += 1
    if (row[2] == 1): 
        hulu_count += 1
    if (row[3] == 1): 
        prime_count += 1
    if (row[4] == 1): 
        disney_count += 1

print("Watchlist shows in Netflix     - Total", netflix_count, "with", (netflix_count/total_count*100), "%")
print("Watchlist shows in Hulu        - Total", hulu_count, "with", (hulu_count/total_count*100), "%")
print("Watchlist shows in Prime Video - Total", prime_count, "with", (prime_count/total_count*100), "%")
print("Watchlist shows in Disney      - Total", disney_count, "with", (disney_count/total_count*100), "%")

Watchlist shows available for Streaming:  8
('Community', 1, 1, 1, 0)
('Money Heist', 1, 0, 0, 0)
('Frontier', 1, 0, 0, 0)
('My Favorite Martian', 0, 1, 0, 1)
('The Slap', 0, 1, 1, 0)
('Doc McStuffins', 0, 1, 0, 1)
('Max Steel', 0, 1, 0, 0)
('Spaced Out', 0, 0, 0, 1)
Watchlist shows in Netflix     - Total 3 with 30.0 %
Watchlist shows in Hulu        - Total 5 with 50.0 %
Watchlist shows in Prime Video - Total 2 with 20.0 %
Watchlist shows in Disney      - Total 3 with 30.0 %


## Results

Now that you have done your analysis, make note of the answers to the following questions by editing the text cell:

1. Was every show on your watchlist in the Kaggle dataset? Do you have any ideas as to why a show might not have been present?

- No. Not all the shows in my watchlist are in the Kaggle dataset. Since 2 of the shows are from regional Indian language, those 2 are not present

2. Did you include a show or shows in your watchlist that is exclusive to one of the platforms? How might that have impacted your analysis?

- Yes. "Spaced Out" is exclusive to Disney, "Max Steel" is exclusive to Hulu, "Frontier" & "Money Heist" are excuysive to Netflix. It will not make any impact to the analysis since the analysis counts the total and percentage of the streaming.

3. Which streaming service(s) offered the most shows on your watchlist? Which streaming service(s) offered the least?

- Hulu with a total of 5 shows with 50%.  Prime Video with a total of 2 shows with 20%

4. Based on the shows you want to watch and the results of your analysis, is there a streaming service you think would be a good fit for you?

- Hulu would be a good fit for me

# Bonus Mission

We didn't end up using that `importance_rank` field, did we?

Well, that was intentional! 

Your bonus mission is to come up with analysis that uses that field to determine, based on watchlist show importance_rank and number of watchlist shows available on a service, which platform you should subscribe to.

In [14]:
# Code Here