<a href="https://colab.research.google.com/github/rkanejac/Data110/blob/main/RalinkaeWeek4HW1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Scraping Box Office Data Using BeautifulSoup and Requests

This Python script automates the process of extracting box office data from BoxOfficeMojo's weekend chart. Utilizing the `requests` library, it fetches the webpage content and employs `BeautifulSoup` for parsing the HTML to isolate the box office table. The script meticulously iterates through table rows, capturing essential details such as rank, release, gross earnings, and more, for each movie listed. The extracted data is then structured and saved into a CSV file named "box_office_data.csv". This approach facilitates easy aggregation, analysis, and storage of box office performance data for further analysis or reporting.


In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
url = "https://www.boxofficemojo.com/weekend/2024W07/?ref_=bo_we_nav"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Use the correct class or id for the table
table = soup.find('table', {'class': 'mojo-body-table'})

data = []

if table:
    rows = table.find_all('tr')[1:]  # Assuming the first row is the header

    for row in rows:
        cells = row.find_all('td')
        if len(cells) >= 9:  # Ensure there are enough cells
            entry = {
                'rank': cells[0].text.strip(),
                'release': cells[1].text.strip(),
                'Title': cells[2].text.strip(),
                'lw': cells[3].text.strip(),
                'percentage': cells[4].text.strip(),
                'change': cells[5].text.strip(),
                'average': cells[6].text.strip(),
                'gross': cells[7].text.strip(),
                'Total Gross': cells[8].text.strip(),
                 'distributor': cells[9].text.strip() if len(cells) > 9 else ''
            }
            data.append(entry)

    if data:
        csv_file = "box_office_data.csv"
        with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=data[0].keys())
            writer.writeheader()
            for item in data:
                writer.writerow(item)
        print(f"Data saved to {csv_file}")
    else:
        print("No data extracted from the table.")
else:
    print("Table not found in the page.")


In [None]:
df=pd.read_csv('/content/box_office_data.csv')
df.head()


# Choose only 6 movies
For this assinment, we do not need to choose all the moive, we only use need to use the first 6 movies... you can do this by creating a new dataset lets call it `data=df.head(6)`

In [None]:
# create the new dataframe and call it data ( look the the hint above)
data = df.head(6)
print(data)




### Lets clean our data, by removeing $ and comma . and we create a new columnt we call it `Total Gross Cleaned' which you will use in your analysis

In [None]:


# This line is correct as per your DataFrame structure
data['Total Gross Cleaned'] = data['Total Gross'].str.replace('[\$,]', '', regex=True).astype(float)
print(data)




# Task 1: Visualize Data with a Bar Graph

**Objective:** Create a bar graph to visualize the relationship between `Title` and `Weekend Gross` using the `data` DataFrame.

**Instructions:**

1. **Plotting**: Utilize matplotlib to create a bar graph that plots each movie title (`Title`) against its corresponding `Weekend Gross`.
2. **Adjustments**: Ensure the graph is readable and appropriately sized to effectively display the data.
3.  **Improvements**: List any aspects of the graph that could be improved.



In [None]:
# Task 1 code need to be here
plt.figure(figsize = (20,10))
plt.bar(data['Title'], data['Total Gross Cleaned'])
plt.xlabel('Movies')
plt.ylabel('Weekend Gross')
plt.show()




## List of Improvements
- The scaling on the y-axis is unclear
- The bars are not in descending or ascending order
- The graph is not visually pleasing
- Due to the scaling on the graph you can't clearly tell how much a movie grossed

## Task 2: Correct Data Sorting and Create a Horizontal Bar Chart


 **Sort the Data**:
   Begin by sorting your data to ensure it's in the correct order for visualization. Use the `sort_values` method on your DataFrame. To sort the data in ascending order based on the 'Total Gross Cleaned' column, execute the following line of code:

   ````sorted_data = data.sort_values(by='Total Gross Cleaned', ascending=True)````


In [None]:
## Task 2 Code be need to be here

# Correcting the sorting method
data = data.sort_values(by = 'Total Gross Cleaned', ascending = True)

# Now plotting with matplotlib
plt.figure(figsize=(20,10))
plt.bar(data['Title'], data['Total Gross Cleaned'])
plt.show()




### Task 3:

To adjust the `Total Gross Cleaned` column values from dollars to millions and enhance readability by converting these values to integers, we need to divide each entry by 1,000,000. This transformation simplifies the data presentation, making large numbers more comprehensible. After dividing, it's recommended to convert the results to integers to remove any decimal points for a cleaner display. This approach is consistent with practices demonstrated in the [`week4.ipynb`](https://github.com/Reben80/Data110-32213/blob/bc90a812b1d18b9b2ff294ad10754ff19525160b/Week4.ipynb) notebook available on GitHub, which serves as a useful reference for this kind of data manipulation.


In [None]:
# Task 3 code need to be here.
data['Total Gross Cleaned'] = data['Total Gross Cleaned']/1000000
data['Total Gross Cleaned'] = data['Total Gross Cleaned'].astype(int)

## Task 4
create the final graph with the following specifications:

1. **Frameless Design**: Ensure the graph does not have an outer frame.
2. **Vertical Grid Only**: Include only vertical grid lines; remove any horizontal grid lines.
3. **Titles and Labels**: Add a meaningful title to the graph as well as labels for the X and Y axes to enhance readability.
4. **Figure Size**: Adjust the figure size to ensure it's suitable and makes the data easy to view and understand.
5. **Orientation**: The graph should be a horizontal bar chart to better display the data.

For detailed examples and guidance on implementing these features, please refer to the `week4.ipynb` notebook available at [ GitHub](https://github.com/Reben80/Data110-32213/blob/bc90a812b1d18b9b2ff294ad10754ff19525160b/Week4.ipynb). The notebook provides valuable insights into data visualization techniques, including how to adjust plot aesthetics to meet specific criteria.


In [None]:
# Task 4 code need to be here.
plt.figure(figsize=(10,10))
plt.grid(axis='x', color='gray', linewidth=0.2)
plt.barh(data['Title'], data['Total Gross Cleaned'], color='red')
plt.gca().invert_yaxis()
ax=plt.gca()
ax.set_axisbelow(True)

ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.tick_params(axis='both', which='both', length=0)

plt.xlabel('Weekend Gross (Millions $)',fontsize='x-large')
plt.title('Weekend Gross of Movies', fontsize='xx-large')
plt.ylabel('Movie', fontsize='x-large')
plt.show()
