# üßë‚Äçüè´ Class Exercise: Real-World Data Workflow in Python
In this exercise, you will:
- Collect and load a real-world dataset.
- Explore and assess the data.
- Clean and validate the dataset.
- Handle missing values, duplicates, and outliers.
- Generate descriptive statistics.
- Create meaningful visualizations.


## Step 1: Data Collection
We will use the **NYC Airbnb Open Data** (2019 snapshot). This dataset contains information about Airbnb listings in New York City.

üëâ Dataset URL: [AB_NYC_2019.csv](https://raw.githubusercontent.com/datasciencedojo/datasets/master/AB_NYC_2019.csv)


In [None]:

import pandas as pd

# Download dataset
url ="https://raw.githubusercontent.com/pjournal/boun01g-data-mine-r-s/gh-pages/Assignment/AB_NYC_2019.csv"
df = pd.read_csv(url)

print("Data loaded successfully!")
print("Shape:", df.shape)
df.head()


Data loaded successfully!
Shape: (48895, 16)


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


## Step 2: Initial Exploration (EDA - Part 1)
Let's start by examining the dataset structure, datatypes, and missing values.

In [None]:

# Dataset info
df.info()

# Summary of missing values
df.isnull().sum()


In [None]:

# Check numerical stats
df.describe()


In [None]:

# Check categorical stats
df.describe(include="object")


## Step 3: Data Cleaning
Here we will:
- Handle missing values.
- Remove duplicates.
- Deal with outliers.


In [None]:

# Fill missing reviews_per_month with 0
df["reviews_per_month"].fillna(0, inplace=True)

# Drop rows with missing 'name' or 'host_name'
df.dropna(subset=["name", "host_name"], inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Handle outliers in 'price'
upper_limit = df["price"].quantile(0.99)
df = df[df["price"] <= upper_limit]

print("Cleaned dataset shape:", df.shape)


## Step 4: Data Validation
Let's perform sanity checks to ensure our dataset is valid.

In [None]:

# Sanity checks
assert df["price"].min() >= 0, "Negative prices found!"
assert df["minimum_nights"].min() >= 1, "Invalid minimum nights found!"
print("Validation checks passed ‚úÖ")


## Step 5: Exploratory Data Analysis (EDA - Part 2)
Let's dig deeper into the dataset and look at patterns and distributions.

In [None]:

# Average price by neighbourhood group (borough)
avg_price = df.groupby("neighbourhood_group")["price"].mean().sort_values(ascending=False)
avg_price


In [None]:

# Count of room types
df["room_type"].value_counts()


## Step 6: Visualization
Now let's visualize key aspects of the data.

In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Price distribution
plt.figure(figsize=(8,5))
sns.histplot(df["price"], bins=50, kde=True)
plt.title("Distribution of Airbnb Prices")
plt.xlabel("Price")
plt.show()


In [None]:

# Listings by borough
plt.figure(figsize=(8,5))
sns.countplot(x="neighbourhood_group", data=df, order=df["neighbourhood_group"].value_counts().index)
plt.title("Number of Listings by Borough")
plt.show()


In [None]:

# Scatter plot: price vs number of reviews
plt.figure(figsize=(8,5))
sns.scatterplot(x="number_of_reviews", y="price", alpha=0.3, data=df)
plt.ylim(0,500)
plt.title("Price vs Number of Reviews")
plt.show()


## Step 7: Wrap-Up Activities
Try solving the following challenges:
- Find the **top 10 most expensive listings**.
- Calculate **average price per room type**.
- Visualize **room type distribution per borough**.
- Check if **minimum nights > 365** (likely outliers) and remove them.


In [None]:
# Your code here