# Exercise - Download and Load Flat Files - STARTER

In this exercise, you will apply your skills to programmatically unzip a .zip folder and gather data from a .tsv and .csv file into pandas dataframes.

In [8]:
#DO NOT MODIFY - imports
import pandas as pd
import zipfile

## 1. Unzip .zip file programmatically

We will load data from the Rotten Tomatoes Top 100 Movies of All Time list along with some short reviews. We've pre-gathered this dataset and stored them in the `reviews.zip` file.

For the first part of this exercise, unzip the `reviews.zip` in read mode using a context manager.

In [6]:
pwd

'/workspace/Lesson 2'

In [14]:
#FILL IN - unzip the zip file in read mode using a context manager

from zipfile import ZipFile

# Path to zip file
path = '/workspace/Lesson 2/reviews.zip'

with ZipFile(path, 'r') as zip_ref:
    # list all files in the zip
    file_names = zip_ref.namelist()
    print("Files in Zip:", file_names)
    
    # There are two files.  We need to import seconf file
    reviews = file_names[1]
    
    # Read the CSV file into the DataFrame
    with zip_ref.open(reviews) as csv_file:
        df = pd.read_csv(csv_file)
        
# Display data frame
df.head(5)

Files in Zip: ['bestofrt.tsv', 'reviews.csv']


Unnamed: 0,Movie,Viewer,Review,Rating
0,The Wizard of Oz (1939),"Mark,Mary","Great movie, excellent plot!",5
1,Get Out (2017),"Tariq,Candice",Could have had better character devlopment.,3
2,The Wizard of Oz (1939),Olga,Ok.,Not Collected
3,Dunkirk (2017),"Candice,Tariq","I loved it, recommended it to all my friends!",5
4,The Jungle Book (2016),Olga,"A great movie, but I felt the plot was rushed.",4


## 2. Load the TSV file

The `reviews` folder contains a `bestofrt.tsv` file which includes the following columns:

- `ranking`: Rank of the movie
- `critic_score`: Rating
- `title`: Title of the movie
- `number_of_critic_ratings`: Number of reviews

The data has 101 rows.

Now, load the .tsv file into a pandas dataframe while:
1. Specifying the data types of the individual columns
2. Denoting the `ranking` column as the index.

In [23]:
#FILL IN - load a tsv into a pandas dataframe

from zipfile import ZipFile

# Path to the zip file
path = '/workspace/Lesson 2/reviews.zip'

# Open the zip file
with ZipFile(path, 'r') as zip_ref:
    # List all files in the zip
    file_names = zip_ref.namelist()
    print("Files in Zip:", file_names)

    # Ensure there are files in the zip archive
    if len(file_names) < 2:
        raise ValueError("Expected at least two files in the zip archive.")

    # Select the second file (assuming index 1 for the second file)
    reviews = file_names[1]  # Second file
    
    # Read the TSV file into the DataFrame
    with zip_ref.open(reviews) as tsv_file:
        df = pd.read_csv(tsv_file, sep=',')  # Specify tab delimiter (,) for TSV files

# Display the first 5 rows of the DataFrame
df.head(5)


Files in Zip: ['bestofrt.tsv', 'reviews.csv']


Unnamed: 0,Movie,Viewer,Review,Rating
0,The Wizard of Oz (1939),"Mark,Mary","Great movie, excellent plot!",5
1,Get Out (2017),"Tariq,Candice",Could have had better character devlopment.,3
2,The Wizard of Oz (1939),Olga,Ok.,Not Collected
3,Dunkirk (2017),"Candice,Tariq","I loved it, recommended it to all my friends!",5
4,The Jungle Book (2016),Olga,"A great movie, but I felt the plot was rushed.",4


In [19]:
#FILL IN - show the dataframe
# Display the first 5 rows of the DataFrame
df.head(5)

Unnamed: 0,Movie,Viewer,Review,Rating
0,The Wizard of Oz (1939),"Mark,Mary","Great movie, excellent plot!",5
1,Get Out (2017),"Tariq,Candice",Could have had better character devlopment.,3
2,The Wizard of Oz (1939),Olga,Ok.,Not Collected
3,Dunkirk (2017),"Candice,Tariq","I loved it, recommended it to all my friends!",5
4,The Jungle Book (2016),Olga,"A great movie, but I felt the plot was rushed.",4


## 3. Load the CSV file

We've also included a review dataset, `reviews.csv`, in the folder, consisting of synthetic data around individual viewers who wrote short reviews and provided ratings corresponding to the movies.

Now load the .csv file into a dataframe **while** doing the following:
- Marketing the 'Not Collected' values as NaNs
- Defining the header as the first (0th) row of the .csv

In [27]:
#FILL IN - load a csv into a pandas dataframe
df = pd.read_csv('/workspace/Lesson 2/reviews/reviews.csv',
                sep = ',',
                header = 0,
                dtype={'movie':'string',
                      'viewer':'string',
                      'refiew':'string',
                      'ranking':'int'},
                 na_values='Not Collected'
                )

In [28]:
#FILL IN - show the dataframe
df.head(10)

Unnamed: 0,Movie,Viewer,Review,Rating
0,The Wizard of Oz (1939),"Mark,Mary","Great movie, excellent plot!",5.0
1,Get Out (2017),"Tariq,Candice",Could have had better character devlopment.,3.0
2,The Wizard of Oz (1939),Olga,Ok.,
3,Dunkirk (2017),"Candice,Tariq","I loved it, recommended it to all my friends!",5.0
4,The Jungle Book (2016),Olga,"A great movie, but I felt the plot was rushed.",4.0
5,High Noon (1952),Aaron,Will not watch again.,1.0
6,Get Out (2017),Olga,Loved it!,
7,The Wizard of Oz (1939),Aaron,Timeless!,
