## data wrangling process:

* Gather (this lesson)
* Assess
* Clean

Gathering data is the first step in data wrangling. Before gathering, we have no data, and after it, we do.

Gathering data varies from project to project. Sometimes you're just given data, or pointed to it like I've done for you throughout this course. Sometimes you need to search for the right data for your project. Sometimes the data you need isn't readily available, and you need to generate it yourself somehow. When you do find your data, it's not unusual for it to be spread across several different sources and file formats, which makes things tricky when organizing the data in your programming environment.

For these reasons and more, gathering can be tricky. In this lesson, which is likely the most technically challenging lesson of the course, you'll acquire the coding skills and general craftiness required to conquer the vast majority of gathering scenarios you'll come across in the future. This is going to be hard sometimes, and that's okay. Stick with it and don't hesitate to reach out for help.

### This lesson will be structured as follows:

* First, we'll pose a few questions.
* Then you'll explore the source of each piece of data we need to answer those questions, each piece from a different source and in a different format.
* Then you'll learn about the structure of each file format.
* Then you'll learn how to handle that file format using Python and its libraries.
* Then you'll actually gather each piece of data to later join together to create your master dataset

Site to Scrap data From


https://www.rottentomatoes.com/top/bestofrt/
    
https://www.rogerebert.com/
    
https://amueller.github.io/word_cloud/

### Navigating Your Working Directory and File I/O

Before you continue on with this lesson, make sure you are comfortable working with your computer's command line interface to access files and folders, and also with reading and writing to files (i.e. part of File I/O or input/output) in Python. It can be extremely frustrating getting bogged down in these seemingly trivial topics.

### Command Line

For the command line interface, here are three excellent resources that I recommend. Pick whichever suits you best:

* Navigating the Terminal: A Gentle Introduction by Marius Masalar (for Mac users) https://computers.tutsplus.com/tutorials/navigating-the-terminal-a-gentle-introduction--mac-3855

* Command Prompt - How to use the simple, basic commands by Codrut Neagu (for Windows users) https://www.digitalcitizen.life/command-prompt-how-use-basic-commands

### Rotten Tomatoes Top 100 Movies of All Time TSV File 

Note: Internal data from a database can be downloaded programmatically from the file storage systems (like Google Drive) for some companies, though it is often trickier than downloading a file hosted on a web page. In practice, internal files aren't often downloaded programmatically for wrangling and analysis/visualization/modeling.

## Flat Files 

Flat files contain tabular data in plain text format with one data record per line and each record or line having one or more fields. These fields are separated by delimiters, like commas, tabs, or colons.

#### Advantages of flat files include:

* They're text files and therefore human readable.
* Lightweight.
* Simple to understand.
* Software that can read/write text files is ubiquitous, like text editors.
* Great for small datasets.


#### Disadvantages of flat files, in comparison to relational databases, for example, include:

* Lack of standards.
* Data redundancy.
* Sharing data can be cumbersome.
* Not great for large datasets (see "When does small become large?" in the Cornell link in More Information).




### Quiz

Are the files pictured below flat files? Match yes or no to each file number in the following quiz.


###  File #1: animals.csv

<img src="file1.png" height=400 width=400>

### File #2: animals.tsv

<img src="file2.png" height=400 width=400>

### File #3: animals.txt

<img src="file3.png" height=400 width=400>

### File #4: animals.txt

<img src="file4.png" height=400 width=400>


## QUIZ QUESTION

#### Are the files pictured above flat files?

##### Values 
    
    Yes, No 
    
#### Files 

    #1. animals.csv
    #2. animals.tsv
    #3. animals.txt
    #4. animals.txt

## Resources 


* Professor Excel:XML & ZIP:<a href="https://professor-excel.com/xml-zip-excel-file-structure/"> Explore Your Excel Workbooks File Structures </a>

* Cornell: Relational Databases -<a href="https://www.cac.cornell.edu/education/Training/DataAnalysis/RelationalDatabases.pdf"> Not Your Father's Flat Files</a>

## Files in Python

pandas has one main function for parsing flat files and it is read_csv. Here is a link to its <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html">documentation.</a>

<a href="https://pandas.pydata.org/pandas-docs/stable/reference/index.html">Flat Files in Pandas </a>

In [1]:
import pandas as pd

df = pd.read_csv("bestofrt.tsv",sep="\t")

In [2]:
df.head()

Unnamed: 0,ranking,critic_score,title,number_of_critic_ratings
0,1,99,The Wizard of Oz (1939),110
1,2,100,Citizen Kane (1941),75
2,3,100,The Third Man (1949),77
3,4,99,Get Out (2017),282
4,5,97,Mad Max: Fury Road (2015),370


## WEB Scraping
* <a href="https://www.rottentomatoes.com/m/et_the_extraterrestrial">Rotten Tomatoes: E.T. the Extra-Terrestrial (1982)</a>
* <a href="https://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a>


## Saving HTML

The two main ways to work with HTML files are:

* Saving the HTML file to your computer (using the <a href="https://2.python-requests.org//en/master/">Requests library</a> for example) library and reading that file into a<b> BeautifulSoup constructor</b>
* Reading the HTML response content directly into a BeautifulSoup constructor (again using the Requests library for example)

You'll learn how this Requests code works under the hood shortly in “Downloading Files from The Internet.”

<p>For this lesson, you’re going to do neither of these. I've downloaded all of the Rotten Tomatoes HTML files for you and put them in a folder called rt_html. I recommend that you do and open the HTML files in your preferred text editor (e.g.<a href="https://www.sublimetext.com/"> Sublime</a>, which is free) to inspect the HTML for the quizzes ahead.</p>

The rt_html folder contains the Rotten Tomatoes HTML for each of the Top 100 Movies of All Time as the list stood at the most recent update of this lesson. I'm giving you these historical files because the ratings will change over time and there will be inconsistencies with the recorded lesson videos. Also, a web page's HTML is known to change over time. Scraping code can break easily when web redesigns occur, which makes scraping brittle and not recommended for projects with longevity. So just use these HTML files provided to you and pretend like you saved them yourself with one of the methods described above.

## More Information

* <a href="https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01"> Towards Data Science: Ethics in Web Scraping</a>


### HTML File Structure

Quiz

With your knowledge of HTML file structure, you're going to use Beautiful Soup to extract our desired Audience Score metric and number of audience ratings, along with the movie title like in the video above (so we have something to merge the datasets on later) for each HTML file, then save them in a pandas DataFrame.

The Jupyter Notebook below contains template code that:

Creates an empty list, df_list, to which dictionaries will be appended. This list of dictionaries will eventually be converted to a pandas DataFrame (this is the most efficient way of building a DataFrame row by row).
Loops through each movie's Rotten Tomatoes HTML file in the rt_html folder.
Opens each HTML file and passes it into a file handle called file.
Creates a DataFrame called df by converting df_list using the pd.DataFrame constructor.
Your task is to extract the title, audience score, and number of audience ratings in each HTML file so each trio can be appended as a dictionary to df_list.

The Beautiful Soup methods required for this task are:

find()
find_all()
There is an excellent tutorial on these methods (Searching the tree) in the Beautiful Soup documentation. Please consult that tutorial if you are stuck.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree
  
https://stackoverflow.com/questions/28056171/how-to-build-and-fill-pandas-dataframe-from-for-loop/28058264#28058264
    