<div style="background-color: #f4f6f8; border-left: 4px solid #2196f3; padding: 12px; margin: 10px 0; color: #1a1a1a;">
  <h3 style="color: #2196f3; margin: 0;">🔧 Feature Engineering</h3>
  <p style="margin-top: 8px;">
    In this step, we enhance our dataset by creating new features or modifying existing ones to uncover deeper insights during exploratory data analysis (EDA). The goal is not to prepare the data for modeling, but rather to:
  </p>
  <ul style="margin: 8px 0 0 20px; padding: 0; list-style-type: disc;">
    <li>Extract meaningful patterns.</li>
    <li>Simplify visualizations and comparisons.</li>
    <li>Improve interpretability of the data.</li>
  </ul>
  <p style="margin-top: 10px;">
    By engineering additional features such as:
  </p>
  <ul style="margin: 5px 0 0 20px; padding: 0; list-style-type: disc;">
    <li>Indicators for review activity (e.g., has_reviews, active_review_months).</li>
    <li>Normalized or binned pricing (e.g., price categories).</li>
    <li>Availability flags (e.g., available_more_than_half_year).</li>
    <li>Categorical groupings (e.g., region-wise or room_type clusters).</li>
  </ul>
  <p style="margin-top: 10px;">
    These derived features help us generate more insightful visualizations and draw meaningful comparisons across different dimensions of the Airbnb dataset.
  </p>
</div>


In [1]:
import pandas as pd

In [2]:
df=pd.read_csv("AB_NYC_Cleaned.csv")

<div style="background-color: #fffbea; border-left: 4px solid #f9a825; padding: 12px; margin: 10px 0; color: #212121;">
  <h4 style="color: #f9a825; margin: 0;">🔹 Feature Name: <code>has_review</code></h4>
  <p style="margin: 8px 0;"><strong>Logic:</strong><br>
    If a listing has more than <code>0</code> reviews, it is marked as <code>1</code>; otherwise, it is marked as <code>0</code>.
  </p>
  <p style="margin: 8px 0;"><strong>Purpose:</strong><br>
    This feature helps distinguish between listings that have received feedback and those that haven’t. It is useful in understanding listing activity, popularity, and reliability.
  </p>
  <p style="margin: 8px 0;"><strong>Why Useful for Visualization:</strong></p>
  <ul style="margin: 0 0 0 20px; padding: 0; list-style-type: disc;">
    <li>Allows analysis of differences in price, room type, or availability between reviewed and non-reviewed listings.</li>
    <li>Helpful for filtering and grouping during exploratory data analysis.</li>
  </ul>
</div>


In [3]:
# Create a new binary feature: has_review
df['has_review'] = df['number_of_reviews'].apply(lambda x: 1 if x > 0 else 0)

<div style="background-color:#e3f2fd; border-left:4px solid #2196f3; padding:14px; margin:12px 0; font-size:15px; line-height:1.6; color:#0a0a0a;">
  <h4 style="margin:0 0 6px 0; color:#0d47a1; font-size:17px;">🔹 Feature: <code>price_per_night</code></h4>
  <p><strong>Logic:</strong> <code>price ÷ minimum_nights</code> to normalize listing cost.</p>
  <p><strong>Purpose:</strong> Standardizes pricing to compare listings fairly.</p>
  <p><strong>Visualization Use:</strong></p>
  <ul style="margin: 0 0 0 20px;">
    <li>Compare cost across <code>room_type</code> and <code>neighbourhood_group</code>.</li>
    <li>Spot overpriced/underpriced listings more clearly.</li>
  </ul>
</div>


In [4]:
import numpy as np

In [6]:
df['price_per_night'] = df.apply(
    lambda row: row['price'] / row['minimum_nights'] if row['minimum_nights'] > 0 else np.nan,
    axis=1
)

# Replace inf with NaN safely
df['price_per_night'] = df['price_per_night'].replace([np.inf, -np.inf], np.nan)


<div style="background-color:#e3f2fd; border-left:4px solid #2196f3; padding:14px; margin:12px 0; font-size:15px; line-height:1.6; color:#0a0a0a;">
  <h4 style="margin:0 0 6px 0; color:#0d47a1; font-size:17px;">🔹 Feature: <code>active_listing</code></h4>
  <p><strong>Logic:</strong> If <code>availability_365 &gt; 0</code>, then <code>active_listing = 1</code>; else <code>0</code>.</p>
  <p><strong>Purpose:</strong> Flags listings that are currently bookable.</p>
  <p><strong>Visualization Use:</strong></p>
  <ul style="margin: 0 0 0 20px;">
    <li>Filter for listings with active availability.</li>
    <li>Understand availability patterns by <code>room_type</code> or <code>neighbourhood_group</code>.</li>
    <li>Analyze market presence of hosts.</li>
  </ul>
</div>


In [7]:
df['active_listing'] = df['availability_365'].apply(lambda x: 1 if x > 0 else 0)


<div style="background-color:#e3f2fd; border-left:4px solid #2196f3; padding:14px; margin:12px 0; font-size:15px; line-height:1.6; color:#0a0a0a;">
  <h4 style="margin:0 0 6px 0; color:#0d47a1; font-size:17px;">🔹 Feature: <code>seasonal_availability</code></h4>
  <p><strong>Logic:</strong> Categorize <code>availability_365</code> into bins (e.g., low, medium, high availability).</p>
  <p><strong>Purpose:</strong> Converts numeric availability into seasonal labels for easier comparison.</p>
  <p><strong>Visualization Use:</strong></p>
  <ul style="margin: 0 0 0 20px;">
    <li>Visualize listing distribution by availability category.</li>
    <li>Compare trends across <code>room_type</code> and <code>neighbourhood_group</code>.</li>
  </ul>
</div>


In [8]:
def categorize_availability(x):
    if x == 0:
        return 'Not Available'
    elif x <= 120:
        return 'Low'
    elif x <= 240:
        return 'Medium'
    else:
        return 'High'

df['seasonal_availability'] = df['availability_365'].apply(categorize_availability)


<div style="background-color:#e3f2fd; border-left:4px solid #2196f3; padding:14px; margin:12px 0; font-size:15px; line-height:1.6; color:#0a0a0a;">
  <h4 style="margin:0 0 6px 0; color:#0d47a1; font-size:17px;">🔹 Feature: <code>is_entire_home</code></h4>
  <p><strong>Logic:</strong> Binary flag where 1 = <code>Entire home/apt</code>, 0 = otherwise.</p>
  <p><strong>Purpose:</strong> Highlights listings offering complete privacy vs shared spaces.</p>
  <p><strong>Visualization Use:</strong></p>
  <ul style="margin: 0 0 0 20px;">
    <li>Compare price or availability between private and shared spaces.</li>
    <li>Useful for filtering or segmenting by privacy preference.</li>
  </ul>
</div>


In [9]:
df['is_entire_home'] = (df['room_type'] == 'Entire home/apt').astype(int)


In [10]:
df.to_csv("AB_NYC_Featured.csv",index=False)

<div style="background-color:#e8f5e9; border-left:6px solid #2e7d32; padding:16px; margin:20px 0; font-size:15.5px; line-height:1.7; color:#1b5e20;">
  <h3 style="margin-top:0; color:#1b5e20;">✅ <strong>Conclusion – Feature Engineering</strong></h3>
  <p>In this step, we engineered several new features to enhance our understanding of the Airbnb dataset and enable more insightful visualizations.</p>
  <ul style="margin-left:20px;">
    <li><code>has_review</code> and <code>active_listing</code> distinguish between listings that are operational or inactive.</li>
    <li><code>price_per_night</code> normalizes pricing by accounting for minimum stay requirements, making cost comparisons fairer.</li>
    <li><code>seasonal_availability</code> categorizes <code>availability_365</code> to capture seasonal booking trends.</li>
    <li>Binary features like <code>is_entire_home</code> and one-hot encoded columns help segment listings by type for better comparative analysis.</li>
  </ul>
  <p>These engineered features not only add depth to our exploratory data analysis (EDA) but also lay the foundation for more targeted and meaningful visualizations in the next steps.</p>
</div>
