# Exercise - Download and Load Flat Files - SOLUTION

In this exercise, you will apply your skills to programmatically unzip a .zip folder and gather data from a .tsv and .csv file into pandas dataframes.

In [1]:
#DO NOT MODIFY - imports
import pandas as pd
import zipfile

## 1. Unzip .zip file programmatically

We will load data from the Rotten Tomatoes Top 100 Movies of All Time list along with some short reviews. We've pre-gathered this dataset and stored them in the `reviews.zip` file.

For the first part of this exercise, unzip the `reviews.zip` in read mode using a context manager.

In [2]:
#FILL IN - unzip the zip file in read mode using a context manager
with zipfile.ZipFile("reviews.zip","r") as zip_ref:
    zip_ref.extractall("reviews/")

## 2. Load the TSV file

The `reviews` folder contains a `bestofrt.tsv` file which includes the following columns:

- `ranking`: Rank of the movie
- `critic_score`: Rating
- `title`: Title of the movie
- `number_of_critic_ratings`: Number of reviews

The data has 101 rows.

Now, load the .tsv file into a pandas dataframe while:
1. Specifying the data types of the individual columns
2. Denoting the `ranking` column as the index.

In [3]:
#FILL IN - load a tsv into a pandas dataframe
tomatoes = pd.read_csv('reviews/bestofrt.tsv', sep='\t', 
                       dtype={'ranking':'int',
                              'critic_score':'int',
                             'title':'string',
                             'number_of_critic_ratings':'int'},
                      index_col='ranking')

In [4]:
#FILL IN - show the dataframe
tomatoes

Unnamed: 0_level_0,critic_score,title,number_of_critic_ratings
ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,99,The Wizard of Oz (1939),110
2,100,Citizen Kane (1941),75
3,100,The Third Man (1949),77
4,99,Get Out (2017),282
5,97,Mad Max: Fury Road (2015),370
...,...,...,...
96,100,Man on Wire (2008),156
97,97,Jaws (1975),74
98,100,Toy Story (1995),78
99,97,"The Godfather, Part II (1974)",72


## 3. Load the CSV file

We've also included a review dataset, `reviews.csv`, in the folder, consisting of synthetic data around individual viewers who wrote short reviews and provided ratings corresponding to the movies.

Now load the .csv file into a dataframe **while** doing the following:
- Marketing the 'Not Collected' values as NaNs
- Defining the header as the first (0th) row of the .csv

In [5]:
#FILL IN - load a csv into a pandas dataframe
individual_df = pd.read_csv('reviews/reviews.csv', 
                            na_values='Not Collected', 
                            header=0)

In [6]:
#FILL IN - show the dataframe
individual_df

Unnamed: 0,Movie,Viewer,Review,Rating
0,The Wizard of Oz (1939),"Mark,Mary","Great movie, excellent plot!",5.0
1,Get Out (2017),"Tariq,Candice",Could have had better character devlopment.,3.0
2,The Wizard of Oz (1939),Olga,Ok.,
3,Dunkirk (2017),"Candice,Tariq","I loved it, recommended it to all my friends!",5.0
4,The Jungle Book (2016),Olga,"A great movie, but I felt the plot was rushed.",4.0
5,High Noon (1952),Aaron,Will not watch again.,1.0
6,Get Out (2017),Olga,Loved it!,
7,The Wizard of Oz (1939),Aaron,Timeless!,
