## HTML to DF

Converting the HTML file contents from [Rotten Tomatoes](https://www.rottentomatoes.com) website to Pandas DataFrame.

In [1]:
from bs4 import BeautifulSoup
import os
import pandas as pd

In [2]:
# List of dictionaries to build file by file and later convert to a DataFrame
df_list = []
folder = 'rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        # Your code here
        # Note: a correct implementation may take ~15 seconds to run

        soup=BeautifulSoup(file, 'lxml')
        title=soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        audience_score=soup.find('div', class_='audience-score meter').find('span').contents[0][:-1]
        num_audience_ratings=soup.find('div', class_='audience-info hidden-xs superPageFontColor')
        num_audience_ratings=num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',', '')
        # Append to list of dictionaries
        df_list.append({'title': title,
                        'audience_score': int(audience_score),
                        'number_of_audience_ratings': int(num_audience_ratings)})
df = pd.DataFrame(df_list, columns = ['title', 'audience_score', 'number_of_audience_ratings'])
df.head()

Unnamed: 0,title,audience_score,number_of_audience_ratings
0,12 Angry Men (Twelve Angry Men) (1957),97,103672
1,The 39 Steps (1935),86,23647
2,The Adventures of Robin Hood (1938),89,33584
3,All About Eve (1950),94,44564
4,All Quiet on the Western Front (1930),89,17768


In [3]:
# Creating pkl file from DataFrame
df.to_pickle("./rt_pickled.pkl")

## Code Testing
The cell below can be run to see if the solution is correct. If an `AssertionError` is thrown, the solution is incorrect. If no error is thrown, the solution is correct.

In [4]:
df_solution = pd.read_pickle('rt_pickled.pkl')
df.sort_values('title', inplace = True)
df.reset_index(inplace = True, drop = True)
df_solution.sort_values('title', inplace = True)
df_solution.reset_index(inplace = True, drop = True)
pd.testing.assert_frame_equal(df, df_solution)

# Further reading and common problems:
[Ethics in Web Scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01)

[Beautiful Soup: Parsing HTML source code](https://www.crummy.com/software/BeautifulSoup/)

[Downloading HTML pages by Requests directly from Python(Requests: HTTP for Humans™)](https://2.python-requests.org//en/master/)

[All Pandas Function References: API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html)

[Beautiful Soup: Searching in a tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree)
[Beautiful Soup and Unicode Problems](https://stackoverflow.com/questions/19508442/beautiful-soup-and-unicode-problems)

[Python: Removing \xa0 from string?](https://stackoverflow.com/questions/10993612/python-removing-xa0-from-string)

[Intro to HTML and CSS Free Udacity course](https://www.udacity.com/course/intro-to-html-and-css--ud001)

[Excel and XML files](https://professor-excel.com/xml-zip-excel-file-structure/)