### Load the dataset and examine the structure

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv("AB_NYC_2019.csv")

# Display the first 5 rows
df.head()

# Check for missing values and data types
df.info()

###  Dataset Shape, Missing Values, and Summary Statistics

In this step, we:
- Check the number of rows and columns
- Identify missing values in each column
- Review summary statistics for numerical variables such as price, reviews, and availability


In [None]:
# Shape of the dataset
print(df.shape)

# Count of mising values per column
print(df.isnull().sum())

# Summary statistics for numeric columns
print(df.describe())

### Data Cleaning and Outlier Analysis

- Missing values are present mostly in `name`, `host_name`, `last_review`, and `reviews_per_month` columns.
- The missing `last_review` and `reviews_per_month` values correspond to listings with no reviews, so they are retained.
- Only 26 listings have prices between $5,000 and $10,000 (extreme values).
- Due to their very low frequency (<0.1% of total data), these listings are retained to preserve data integrity.
- No removal of rows performed at this step; dataset is ready for further analysis.


In [None]:
# Step 3: Data Cleaning and Outlier Analysis

# Check for missing values again (just to be sure)
print("Missing values per column:")
print(df.isnull().sum())

# Summary statistics to understand data distribution
print("\nSummary statistics:")
print(df.describe())

# Handling missing values
# Columns 'name' and 'host_name' have a small number of missing entries,
# and 'last_review' and 'reviews_per_month' have about 10,000 missing entries.
# Since 'last_review' and 'reviews_per_month' are related to reviews,
# we will leave them as is for now or consider imputation later if necessary.
print("\nPercentage of missing values:")
print((df.isnull().sum() / len(df)) * 100)

# Outlier detection for price variable
high_price_count = df[(df['price'] >= 5000) & (df['price'] <= 10000)].shape[0]
print(f"\nNumber of listings priced between $5000 and $10000: {high_price_count}")

# Since only 26 listings fall in this high price range out of 48895,
# they will be retained to avoid losing potentially valuable data.
# These are likely luxury listings in NYC.

# If desired, you could optionally remove listings with price=0 or extremely high prices (e.g., >10000),
# but here we keep the dataset as is to preserve integrity.

# Summary note:
print("\nData cleaning decisions:")
print("- Minor missing values in 'name' and 'host_name' columns; can be left as is or imputed later.")
print("- Missing values in 'last_review' and 'reviews_per_month' due to no reviews; may keep as is.")
print("- Extreme price outliers are minimal (26 listings between $5000 and $10000) and retained.")


### Exploratory Data Analysis (EDA)

In this step, we will explore the data using statistical summaries and visualizations to uncover patterns, trends, and insights. Our goals include:

- Understanding the distribution of numerical features like price and number of reviews.
- Examining the most common neighborhoods and room types.
- Identifying relationships between location, price, and reviews.
- Preparing insights for interactive visualization with Folium in the next step.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the style for better visuals
sns.set_theme(style="whitegrid")

# Distribution of prices (limited to values below $500 for better visualization)
plt.figure(figsize=(10,6))
sns.histplot(df[df['price'] < 500]['price'], bins=50, color='skyblue', kde=True)
plt.title("Distribution of Prices (Below $500)")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()


In [None]:
# Count room types
room_counts = df['room_type'].value_counts()

# Plot pie chart
plt.figure(figsize=(7,7))
plt.pie(room_counts, labels=room_counts.index, autopct='%1.1f%%', colors=sns.color_palette('pastel'))
plt.title("Room Type Distribution")
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()



In [None]:
# Top 10 most listed neighborhoods
top_neighborhoods = df['neighbourhood'].value_counts().head(10)

plt.figure(figsize=(10,5))
sns.barplot(x=top_neighborhoods.values, y=top_neighborhoods.index, palette='viridis')
plt.title("Top 10 Neighborhoods by Number of Listings")
plt.xlabel("Number of Listings")
plt.ylabel("Neighborhood")
plt.show()


In [None]:
# Scatter plot for price vs number of reviews (limited for visibility)
plt.figure(figsize=(10,6))
sample = df[df['price'] < 500].sample(2000, random_state=1)
sns.scatterplot(data=sample, x='price', y='number_of_reviews', hue='room_type', alpha=0.7)
plt.title("Price vs. Number of Reviews (Sampled, Price < $500)")
plt.xlabel("Price")
plt.ylabel("Number of Reviews")
plt.legend(title='Room Type')
plt.show()


##  Key Insights from EDA

1. **Price Distribution**  
   - The vast majority of listings are priced around $100.  
   - After $250, listing frequency significantly drops, with very few listings priced above $500.

2. **Room Type Distribution**  
   - Most listings are either **Entire home/apt** (~25,000) or **Private room** (~22,000).  
   - **Shared rooms** are rare, with only ~1,000 listings.

3. **Top Neighborhoods**  
   - **Williamsburg** and **Bedford-Stuyvesant** have the highest number of listings (3,500–4,000).  
   - **Harlem** and **Bushwick** follow, while the remaining neighborhoods in the top 10 range between 1,500–2,000 listings.

4. **Price vs. Number of Reviews**  
   - Most reviews cluster in listings priced between **$50 and $250**.  
   - **Private rooms and shared rooms** dominate lower price ranges, especially around $100.  
   - Listings above **$150** are mostly **Entire home/apartment**, showing a correlation between price and property type.



### Geospatial Analysis with Folium
In this step, we will visualize Airbnb listings on an interactive map using Folium. This will help us understand the spatial distribution of listings and detect patterns by neighborhood and price.

In [None]:
import folium
from folium.plugins import MarkerCluster

# Create a base map centered around NYC
nyc_map = folium.Map(location=[40.7128, -74.0060], zoom_start=11)

# Create a marker cluster
marker_cluster = MarkerCluster().add_to(nyc_map)

# Add markers to the cluster
for index, row in df.iterrows():
    folium.Marker(
        location=[row['latitude'], row['longitude']],
        popup=f"Price: ${row['price']}<br>Room Type: {row['room_type']}<br>Neighbourhood: {row['neighbourhood']}",
    ).add_to(marker_cluster)

# Display map
nyc_map


### Interactive Map with Marker Clusters
This map uses folium.plugins.MarkerCluster to group nearby Airbnb listings across New York City. Each marker displays key information such as price, room type, and neighborhood group when clicked. This clustering technique helps prevent overlapping markers and provides a cleaner view when zoomed out.



In [None]:
# Import necessary packages
import folium
from folium.plugins import MarkerCluster

# Create base map centered on New York City
nyc_map = folium.Map(location=[40.7128, -74.0060], zoom_start=11)

# Create a Marker Cluster object
marker_cluster = MarkerCluster().add_to(nyc_map)

# Add listings to the marker cluster
for _, row in df.iterrows():
    folium.Marker(
        location=[row['latitude'], row['longitude']],
        popup=folium.Popup(
            f"<b>Price:</b> ${row['price']}<br><b>Room Type:</b> {row['room_type']}<br><b>Neighborhood:</b> {row['neighbourhood_group']}",
            max_width=250
        ),
        icon=folium.Icon(color="blue", icon="home", prefix="fa")
    ).add_to(marker_cluster)

# Display the map
nyc_map


### Price Distribution with Circle Markers
This map visualizes Airbnb listings with folium.CircleMarker, color-coded by price range:

🟢 Green: Budget listings ($0–100)

🟠 Orange: Mid-range listings ($101–250)

🔴 Red: Premium listings ($251 and above)

It helps identify the geographical spread of various pricing tiers and spot pricing patterns across different boroughs.

In [None]:
# Create another base map
price_map = folium.Map(location=[40.7128, -74.0060], zoom_start=11)

# Loop through the dataset and add circle markers based on price
for _, row in df.iterrows():
    price = row['price']
    color = 'green' if price <= 100 else 'orange' if price <= 250 else 'red'
    
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=3,
        fill=True,
        color=None,
        fill_color=color,
        fill_opacity=0.5,
        popup=f"<b>${price}</b> - {row['room_type']}"
    ).add_to(price_map)

# Display the map
price_map


### Room Type Layers with LayerControl
This layered map separates listings by room type (Entire home/apt, Private room, Shared room, etc.). Users can toggle each room type layer on or off to explore the spatial distribution. The map uses folium.FeatureGroup and LayerControl for interactive filtering.

In [None]:
# Unique room types in dataset
room_types = df['room_type'].unique()

# Create base map
layer_map = folium.Map(location=[40.7128, -74.0060], zoom_start=11)

# Add each room type as a separate layer
for room in room_types:
    fg = folium.FeatureGroup(name=room)
    for _, row in df[df['room_type'] == room].iterrows():
        folium.CircleMarker(
            location=[row['latitude'], row['longitude']],
            radius=2,
            fill=True,
            fill_opacity=0.6,
            color='blue',
            popup=f"{room} - ${row['price']}"
        ).add_to(fg)
    fg.add_to(layer_map)

# Add layer control toggle
folium.LayerControl().add_to(layer_map)

# Display map
layer_map


## NYC Airbnb Listings Map

An interactive map showing NYC Airbnb listings clustered for clarity. Markers are color-coded by price:

- 🟢 Up to $100 (Budget)  
- 🟠 $101–$250 (Mid-range)  
- 🔴 Above $250 (Premium)  

Marker size (radius) reflects the number of reviews: more reviews mean larger markers.

Click a marker to see details like price, neighborhood, room type, minimum nights, and review count.


In [None]:
import folium
from folium.plugins import MarkerCluster

# Create base map centered on NYC
nyc_map = folium.Map(location=[40.7128, -74.0060], zoom_start=11)

# Create MarkerCluster object to group markers
marker_cluster = MarkerCluster().add_to(nyc_map)

for idx, row in df.iterrows():
    price = row['price']
    reviews = row['number_of_reviews']

    # Determine marker color based on price range
    if price <= 100:
        color = 'green'
    elif price <= 250:
        color = 'orange'
    else:
        color = 'red'

    # Set marker radius based on number of reviews (min 3, max 15)
    radius = min(max(reviews / 10, 3), 15)

    # Create popup text with listing details
    popup_text = (
        f"Price: ${price}<br>"
        f"Neighborhood: {row['neighbourhood']}<br>"
        f"Room type: {row['room_type']}<br>"
        f"Minimum nights: {row['minimum_nights']}<br>"
        f"Number of reviews: {reviews}"
    )

    # Add CircleMarker to the cluster
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=radius,
        color=color,
        fill=True,
        fill_color=color,
        fill_opacity=0.7,
        popup=folium.Popup(popup_text, max_width=300)
    ).add_to(marker_cluster)

# Display the map
nyc_map


### 🔴 Ultra Luxury Listings Analysis
This section highlights high-end Airbnb listings in New York City. Properties priced above $800 per night are considered ultra luxury.

The analysis includes:

Top Neighbourhoods: A bar chart shows which neighbourhoods have the highest number of listings over $800 per night. This gives insights into where luxury accommodations are most concentrated.

Interactive Map: These listings are visualized with a Folium map, where:

Each circle represents an ultra-luxury listing.

The color is fixed (dark red) to emphasize premium status.

The circle radius scales with the number of reviews to reflect popularity and guest engagement.

Popups display price, room type, and review count for each listing.

This analysis helps identify exclusive areas in NYC where luxury listings are both in demand and well-reviewed.


In [None]:
import matplotlib.ticker as ticker

# Filter listings with price > $800
luxury_df = df[df['price'] > 800]

# Count of listings by neighbourhood
luxury_neighbourhoods = luxury_df['neighbourhood'].value_counts().head(10)

# Plotting
plt.figure(figsize=(10,6))
ax = sns.barplot(x=luxury_neighbourhoods.values, y=luxury_neighbourhoods.index, palette='flare')

# Set title and labels
plt.title("Top Neighbourhoods with Listings over $800")
plt.xlabel("Number of Listings")
plt.ylabel("Neighbourhood")

# Force x-axis ticks to show integers
ax.xaxis.set_major_locator(ticker.MaxNLocator(integer=True))

plt.show()


In [None]:
import folium

# Create a new map for ultra luxury listings
luxury_map = folium.Map(location=[40.7128, -74.0060], zoom_start=11)

# Filter listings with price > 800
luxury_df = df[df['price'] > 800]

# Add CircleMarkers for each luxury listing
for _, row in luxury_df.iterrows():
    radius = 3 + (row['number_of_reviews'] / 20)  # Scale by reviews
    popup_text = (
        f"Price: ${row['price']}<br>"
        f"Neighborhood: {row['neighbourhood']}<br>"
        f"Room type: {row['room_type']}<br>"
        f"Minimum nights: {row['minimum_nights']}<br>"
        f"Number of reviews: {row['number_of_reviews']}"
    )
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=radius,
        color='darkred',
        fill=True,
        fill_color='darkred',
        fill_opacity=0.7,
        popup=folium.Popup(popup_text, max_width=300)
    ).add_to(luxury_map)

# Display the map
luxury_map


### Conclusion
-The Airbnb dataset for NYC reveals key insights into listing distribution, pricing, and guest engagement.

-Most listings are concentrated in popular neighborhoods like Williamsburg and Bedford-Stuyvesant.

-The majority of listings are priced under $250 per night, with only a small fraction categorized as ultra-luxury ($800+).

-Room type distribution shows a dominance of Entire Home/Apartments, followed by Private and Shared rooms.

-The interactive map visualization helped identify how pricing and review popularity vary geographically.

-Ultra-luxury listings tend to cluster in specific neighborhoods and maintain high review counts, indicating strong demand despite high prices.

-Data cleaning decisions, such as handling missing values and filtering extreme price outliers, improved analysis accuracy without losing significant information.