# DataDive Week 2 - Data Collection & Understanding

Use this notebook to follow along with the Week 2 session of the Data Dive!

Where you see `...` in the code blocks, you can fill it in with your own code.

First, run the cell below to import the necessary libraries needed:

In [None]:
import pandas as pd # used for creating and manipulating dataframes
import matplotlib.pyplot as plt # used for data visualisation

# used for web-scraping:
import requests
from bs4 import BeautifulSoup
import csv

## Data Collection

### Loading & Exploring A Dataset

Head to this link to find the example dataset to be used: https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data

Once you have the csv file, make sure it is in the same directory as this notebook (if using Google Colab: upload the csv on the right by going to Files -> Upload To Session Storage)

Feel free to use sources such as [pandas documentation](https://pandas.pydata.org/docs/user_guide/10min.html#minutes-to-pandas), or the [Datasoc Pandas guide](https://www.sheffdatasoc.org/guides/208b27c9-26a3-4b01-9f4a-00df9ac0a60b)!

In [None]:
# import the csv file into a pandas dataframe:
df = ...

In [None]:
# view the first 5 rows of the dataframe:
...

In [None]:
# list the columns in the dataframe:
...

In [None]:
# find the number of rows and columns in the dataframe (the 'shape'):
...

In [None]:
# identify the datatypes of the columns in the dataframe:
...

In [None]:
# select one column from the dataframe:
...

In [None]:
# select one categorical column, and find the frequencies of values in that column:
...

In [None]:
# is there any missing data?
...

<details>
<summary>Hints!</summary>

Import the csv using `pd.read_csv(...)`

The following functions/features might be useful: `df.head()`, `df.info()`, `df.describe()`, `df.columns`, `df.shape`, `df.dtypes`, `df.value_counts()`, `df.isna()`...

</details>


In [None]:
# feel free to do any more exploration of the dataset! can you find some summary statistics?
# OR import another dataset you can find, and explore it!
...
...
...

### Web-scraping Example

The following code block shows an example of web-scraping from this site: https://quotes.toscrape.com, using beatifulsoup. This works by looking for elements of the HTML, and works well for static webpages.

**Remember to be mindful of what data you are scraping, and whether this is legal and ethical!**

In [None]:
# the necessary libraries have already been imported
url = "https://quotes.toscrape.com"
csv_path = "bs4_scraped_quotes.csv"

response = requests.get(url) # a GET request to retrieve the webpage
soup = BeautifulSoup(response.text, "html.parser") # parsing the webpage to beautifulsoup
quotes = soup.find_all("span", class_="text") # finds all sections of the page of class 'text'
authors = soup.find_all("small", class_="author") # finds all 'small' texts of class 'author'

with open(csv_path, "w", encoding="utf-8-sig", newline="") as w: # creating a csv file and writing the data to it!
  writer = csv.writer(w)
  writer.writerow(["Quote", "Author"])
  for quote, author in zip(quotes, authors):
    writer.writerow([quote.text, author.text])

Now you can use the csv with pandas!

In [None]:
quotesdf = pd.read_csv(csv_path)
quotesdf.head(10)

## Data Visualisation

This follows on from the data collection task with the NY Airbnb dataset!

In [None]:
# check the dataframe is still loaded:
df.head()

Examples of different types of chart, using matplotlib (run each cell to display the chart):

In [None]:
# histogram:
plt.hist(x='number_of_reviews', data=df, bins=30)
plt.title('# of Reviews Received')
plt.show()

In [None]:
# bar chart:
borough_frequencies = df['neighbourhood_group'].value_counts().reset_index() # you may have to manipulate the dataframe / aggregate the data to create the plots that you want!

plt.bar(x='neighbourhood_group', height='count', data=borough_frequencies)
plt.title('# of Listings in each NY Borough')
plt.show()

Your turn! Use the [matplotlib documentation](https://matplotlib.org/stable/tutorials/pyplot.html#sphx-glr-tutorials-pyplot-py) and [Datasoc's guides](https://www.sheffdatasoc.org/guides/37891bb7-e18a-4940-9c6f-3f9cf9fc7992) for much more detail.

In [None]:
# create a scatter plot for two of the numeric features of the data
...

In [None]:
# create a box plot
...

In [None]:
# create a pie chart
...

In [None]:
# create any other types of visualisation you like!
# try exporting them as images!
...
...
...

Now try to make your visualisations more exciting! Can you improve the titles, add labels, add subplots (multiple charts in one figure), change the colours, etc.

You also might want to use `seaborn`, another Python library that interacts with matplotlib for creating nicer-looking visualisations.

Use the documentation to help you!

<br>

**Want more activities?** Check out Datasoc's [Pandas Sandbox Session](https://github.com/sheffdatasoc/2024-25/tree/main/Pandas) and [Data Visualisation Sandbox Session](https://github.com/sheffdatasoc/2024-25/tree/main/Data%20Visualisation) from 24/25!