# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Team
##### **Team Member 1 -2210992525** - **Vikas**
##### **Team Member 2 -2210992526**  - **Vikram Monga**
##### **Team Member 3 -2210992529**  - **Vinay Sharma**
##### **Team Member 4 -2210992531**  - **Vinit**

# **Project Summary -**


Introducing our innovative house price prediction model: a sophisticated yet user-friendly tool designed to provide actionable insights into the complex world of real estate valuation. Built upon a foundation of meticulous data collection and preprocessing, our model aggregates diverse datasets encompassing a multitude of factors influencing property prices. From fundamental attributes like location and size to nuanced considerations such as neighborhood demographics, amenities, and economic indicators, we leave no stone unturned in capturing the intricacies of property valuation.

At the heart of our model lies advanced feature engineering techniques, where we extract meaningful insights from the raw data and craft new variables to enrich predictive accuracy. By encoding factors like proximity to essential amenities, neighborhood safety ratings, school district quality, and transportation accessibility, we ensure that our model captures the holistic landscape of property valuation with precision.

Through the application of state-of-the-art machine learning algorithms, our model undergoes rigorous training and optimization to learn from the data and discern complex patterns. Leveraging supervised learning techniques such as regression and ensemble methods, we map the relationship between input features and target variables, iteratively refining our model's predictive capabilities with each iteration.

Validation and evaluation are paramount in ensuring the robustness and generalization of our model. We employ rigorous techniques such as k-fold cross-validation and performance metrics like Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) to scrutinize our model's predictive performance across diverse datasets. By simulating real-world scenarios and stress-testing against unseen data, we validate the reliability and scalability of our model, instilling confidence in its predictive prowess.

The integration and deployment of our model are seamless and intuitive, designed to empower stakeholders with accessible and actionable insights. Whether through web applications, mobile interfaces, or API integrations, our model provides user-friendly platforms for accessing predictive insights effortlessly. Real-time updates and scalability ensure that our model remains adaptable to evolving market dynamics, sustaining its relevance and utility over time.

Our model not only delivers predictive insights but also prioritizes interpretability, enabling users to understand the underlying factors driving property valuations. Through interactive visualizations and transparent model explanations, stakeholders gain deeper comprehension of market trends and investment opportunities. From identifying undervalued properties to mitigating risk, our model equips users with the foresight needed to make informed decisions with confidence.

In conclusion, our house price prediction model stands as a testament to innovation in real estate analytics. By leveraging advanced algorithms, comprehensive datasets, and intuitive interfaces, we empower stakeholders to navigate the dynamic landscape of property investments effectively. Whether you're a seasoned investor or a first-time homebuyer, our model is your trusted companion in unlocking the full potential of the real estate market.Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Build a regression model to predict house prices based on features such as location, size, number of bedrooms, etc. This project involves data preprocessing, feature engineering, and evaluating different regression algorithms**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('Housing.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()

duplicate_count

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_values = df.isnull().sum().reset_index()
null_values

In [None]:
# Visualizing the missing values

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique().reset_index()
unique_values

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df.sample(20)

### What all manipulations have you done and insights you found?

There is no null values and duplicate values in data.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
plt.hist(df['price'], bins=20)
plt.title(' variance of Prices')
plt.xlabel('Price')
plt.ylabel('variance')
ax = plt.gca()
ax.ticklabel_format(style='plain', useOffset=False)
plt.show()

##### 1. Why did you pick the specific chart?

 Chosen because it effectively shows the spread and frequency of price values.

##### 2. What is/are the insight(s) found from the chart?

Popular price ranges: Guide pricing and marketing strategies.
Price variation: Understand market consistency for informed decisions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Data-driven decisions on pricing, marketing, and inventory.
Negative : Overemphasizing price, misinterpreting outliers.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.scatter(df['area'], df['price'])
plt.title('relation between Area and Price')
plt.xlabel('Area')
plt.ylabel('Price')
ax = plt.gca()
ax.ticklabel_format(style='plain', useOffset=False)
plt.show()


##### 1. Why did you pick the specific chart?

Shows the relationship between area and price, allowing for a visual exploration of potential correlations positive, negative, or none between the two variables.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Look for trends: Positive correlation (larger area, higher price), negative correlation (larger area, lower price), or no clear correlation.
Identify outliers (data points far from the main cluster) for further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Inform pricing strategies based on area (e.g., premium for larger spaces, adjusting prices for different area ranges).
Negative : Overemphasis on area alone could lead to neglecting other factors (location, amenities), and misinterpreting correlation as causation.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
avg_price_by_furnishing = df.groupby('airconditioning')['price'].mean()
avg_price_by_furnishing.plot(kind='bar')
plt.title('Average Price by airconditioning')
plt.xlabel('airconditioning')
plt.ylabel('Average Price')
ax = plt.gca()
ax.ticklabel_format(style='plain', useOffset=False, axis='y')

plt.show()

##### 1. Why did you pick the specific chart?

Compares average price across airconditioning categories (yes/no), showing relative differences and which category has a higher/lower average price.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Look for differences: Higher average price with air conditioning might indicate its value; lower average price without it could be due to various factors.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Inform property selection and pricing strategies based on air conditioning presence/absence and market trends.
Negative: Overemphasis on air conditioning can lead to neglecting other factors like location and amenities.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Average price by No of parkings
avg_price_by_parking = df.groupby('parking')['price'].mean()
avg_price_by_parking.plot(kind='bar')
plt.title('Average Price by parking')
plt.xlabel('parking')
plt.ylabel('Average Price')
ax = plt.gca()
ax.ticklabel_format(style='plain', useOffset=False, axis='y')

plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is chosen here because it effectively compares the average price across different categories of parking (number of parking spots). It helps to visualize:
Relative differences in average price for properties with varying numbers of parking spaces.
Which category has a higher or lower average price.

##### 2. What is/are the insight(s) found from the chart?

Impact of parking availability on price:
Higher average price for properties with more parking spaces might indicate its value for buyers.
Lower average price for properties with fewer parking spaces could be due to various factors (e.g., location, lack of demand).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Informed property selection: Understanding how parking availability affects average price can help businesses:
Evaluate investment opportunities considering the number of parking spaces available.
Develop pricing strategies for properties with different parking options based on market trends.
Negative Impact:

Oversimplification of price factors: Number of parking spaces is just one factor influencing price. Overemphasis on its availability could lead to neglecting other important aspects like location, amenities, and market conditions.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# relation between price bedrooms adn bathrooms
sns.scatterplot(data=df, x="bedrooms", y="bathrooms", hue="price")
plt.title('Relation between Price, Bedrooms and Bathrooms')
plt.xlabel('Bedrooms')
plt.ylabel('Bathrooms')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot allows for exploring potential relationships between three continuous variables: bedrooms, bathrooms, and price. The color-coding using price

##### 2. What is/are the insight(s) found from the chart?

Relationship between factors: Look for potential trends or clusters:
Price increasing as both bedrooms and bathrooms increase.
Limited impact of bedrooms on price if bathrooms remain low.
Price changing based on specific combinations of bedrooms and bathrooms.
Identifying outliers: Data points with unusual combinations or extreme values might require further investigation.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

Informed pricing strategies: Understanding the relationship between price, bedrooms, and bathrooms can help businesses:
Set competitive prices considering the combination of these features.
Identify properties with potential for higher pricing based on favorable combinations of bedrooms and bathrooms.
Negative:

Oversimplification of price factors: These three factors are just a subset of influences on price. Overemphasizing them could lead to neglecting other important aspects like location, amenities, and market conditions.
Misinterpreting correlation as causation: Observed relationships don't necessarily imply cause and effect. For example, a higher price associated with more bedrooms and bathrooms might simply reflect larger properties, not a direct causal relationship.Answer Here

#### Chart - 6

In [None]:
df.describe()

In [None]:
# Chart - 6 visualization code
# no if houses of each furnishingstatus house
df['furnishingstatus'].value_counts().plot(kind='bar')
plt.title('furnishingstatus of houses')
plt.xlabel('furnishingstatus')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is suitable because it effectively compares the frequency (count) of each unique value in the furnishingstatus column. It helps visualize:
Number of houses belonging to each furnishing category (furnished, semi-furnished, unfurnished).Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Look for the dominant furnishing type: The bar with the highest value indicates the most common furnishing status among the houses.
Observe the distribution across categories: Compare the heights of the bars to understand the relative prevalence of each furnishing type.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

Target marketing efforts: Understanding the most common furnishing status can help tailor marketing messages and strategies towards specific customer segments based on their preferences.
Inventory management: Insights may help adjust inventory strategy, potentially focusing on acquiring properties that cater to the most in-demand furnishing type.
Negative:

Overgeneralization: Assuming a single furnishing type is universally preferred can lead to neglecting the needs of customers seeking other options.


#### Chart - 7

In [None]:
# Chart - 7 visualization code
# prices according to bedrooms
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x='bedrooms', y='price', hue='bedrooms', data=df, palette='tab10')
plt.title('Box Plot of Prices based on Bedrooms')
plt.xlabel('Bedrooms')
plt.ylabel('Price')
plt.legend(title='Bedrooms')
ax = plt.gca()
ax.ticklabel_format(style='plain', useOffset=False, axis='y')

plt.show()

##### 1. Why did you pick the specific chart?

A box plot allows for comparing the distribution of price across different categories of bedrooms. The color-coding adds another layer:
Visually distinguishing the distribution of prices for different numbers of bedrooms (e.g., higher number of bedrooms might generally have a different price range).
Identifying outliers (data points outside the box) for each bedroom category.

##### 2. What is/are the insight(s) found from the chart?

Price distribution for each bedroom category:
Observe the spread of prices within each box (represented by the box's height) for different bedroom counts.
Compare the median prices (indicated by the horizontal lines within the boxes) across different bedroom categories.
Outliers: Data points outside the boxes in each color group might be outliers and warrant further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

Informed pricing strategies: Understanding the distribution of prices for different bedroom categories can help businesses:
Set competitive prices based on the number of bedrooms.
Identify potential price adjustments based on observed variations within each category.
Negative :

Overreliance on bedrooms alone: Relying solely on the number of bedrooms for pricing can neglect other important factors like location, amenities, and overall property condition.

#### Chart - 8

In [None]:
# # Chart - 8 visualization code
# Number of houses with airconditioning ,parking and furnishingstatus
df.groupby(['airconditioning','parking','furnishingstatus']).size().unstack().plot(kind='bar', stacked=True)
plt.title('Count of Houses by Air Conditioning, Parking, and Furnishing Status')
plt.xlabel('Air Conditioning, Parking, and Furnishing Status')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

A stacked bar chart is suitable because it effectively compares the number of houses across different combinations of three categorical variables: airconditioning, parking, and furnishingstatus. The stacking allows visualization of:
Total count of houses for each unique combination (sum of individual bars in a stacked group).
Distribution of counts within each combination category (relative heights of bars within a stack).

##### 2. What is/are the insight(s) found from the chart?

Look for combinations with a high number of houses: Identify which combinations of air conditioning, parking, and furnishing status have the most listings.
Compare relative distribution within each category: Analyze how the counts are distributed across different options for each individual variable (e.g., furnished vs. unfurnished within each air conditioning and parking combination).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

Informed inventory selection: Understanding which combinations are most prevalent can help businesses:
Focus on acquiring properties that align with customer preferences based on these features.
Adjust pricing strategies based on the relative demand for specific combinations.
Negative :

Oversimplification of market preferences: Attributing high overall count solely to individual features can be misleading. It's essential to consider all three factors in combination and their potential interactions.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# price density graph
import seaborn as sns
sns.kdeplot(df['price'], color="blue", shade=True)
plt.title('KDE Plot of Price')
plt.xlabel('Price')
plt.ylabel('Density')
ax = plt.gca()
ax.ticklabel_format(style='plain', useOffset=False, axis='both')
plt.xticks(rotation='vertical')

plt.show()

##### 1. Why did you pick the specific chart?

A KDE plot is suitable because it effectively shows the density distribution of a continuous variable, in this case, price. It helps visualize:
Overall shape and spread of price data (represented by the smooth curve).
Areas with higher concentrations of prices (indicated by higher peaks on the curve).

##### 2. What is/are the insight(s) found from the chart?

Observe the general shape of the curve:
If symmetrical, it suggests a relatively balanced distribution of prices around a central value.
If skewed, it indicates a bias towards either higher or lower prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

Informed pricing strategies: Understanding the density distribution of prices can help businesses:
Identify price ranges with higher competition.
Set competitive prices considering the overall price distribution and specific density peaks.
Negative :

Misinterpreting density as prevalence: Higher density doesn't necessarily equate to absolute prevalence. It represents the relative concentration of data points within a specific range.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Avering stories of houses
import seaborn as sns
# violin plot of the bedrooms column
sns.violinplot(data=df, x="stories")
plt.title("Violin Plot of stories")
plt.xlabel("stories")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A violin plot is a suitable choice because it effectively shows both the distribution and density of the stories variable. It provides insights into:
Spread of data: The overall range of the number of stories across the properties.
Central tendency: The median number of stories (indicated by the vertical line within the violin shape).
Density at different values: The wider the violin at a specific story count, the higher the concentration of houses with that number of stories.

##### 2. What is/are the insight(s) found from the chart?

Observe the spread of the violin: A wider violin indicates a larger spread of the number of stories across the houses, while a narrower violin suggests a more concentrated distribution.
Look for the widest section of the violin: This area represents the number of stories with the highest concentration of houses.
Identify the position of the vertical line: This line indicates the median number of stories, providing a sense of the central tendency in the data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

Informed property selection: Understanding the distribution of stories can help businesses:
Target specific property types based on the desired number of stories (e.g., focusing on single-story properties for families seeking easier accessibility).
Adjust pricing strategies considering the number of stories, as it might influence property value in some markets.
Negative :

Overemphasis on individual values: Focusing solely on the most common number of stories can overlook other important factors affecting price and desirability.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate the average price for each number of bedrooms
avg_price_by_bedrooms = df.groupby('bedrooms')['price'].mean()

# Create line plot
plt.figure(figsize=(10, 6))
sns.lineplot(x=avg_price_by_bedrooms.index, y=avg_price_by_bedrooms.values, marker='o', color='blue')
plt.title('Average Price by Number of Bedrooms')
plt.xlabel('Number of Bedrooms')
plt.ylabel('Average Price')
plt.grid(True)
ax = plt.gca()
ax.ticklabel_format(style='plain', useOffset=False, axis='y')

plt.show()

##### 1. Why did you pick the specific chart?

A line plot is appropriate here because it effectively visualizes the trend in average price as the number of bedrooms increases. It helps identify any linear relationship or patterns between these variables.

##### 2. What is/are the insight(s) found from the chart?

Observe the trend: See if the average price increases, decreases, or remains stable with more bedrooms. This can inform pricing strategies for different property types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact: Knowing the price trend allows businesses to set competitive prices based on bedroom configuration, potentially leading to faster sales or rentals. Negative impact: Relying solely on this trend might neglect other factors influencing price (e.g., location, property condition). Businesses should consider a holistic approach when setting prices.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
fig, axes = plt.subplots(2, 2, figsize=(10, 10))

# scatter plot of the bedrooms and price
sns.scatterplot(data=df, x="bedrooms", y="price", ax=axes[0, 0])

# box plot of the bathrooms
sns.boxplot(data=df, x="bathrooms", ax=axes[0, 1])

# histogram of the area column
sns.histplot(data=df, x="area", ax=axes[1, 0])

# pie chart of the furnishingstatus column
df['furnishingstatus'].value_counts().plot(kind='pie', ax=axes[1, 1])
plt.show()

##### 1. Why did you pick the specific chart?

Explores potential correlation between bedrooms and price
Compares the distribution of price across different bathroom categories.
Shows the frequency distribution of property area.
Illustrates the proportions of houses in each furnishing category (furnished, semi-furnished, unfurnished).


##### 2. What is/are the insight(s) found from the chart?

Look for trends - higher prices with more bedrooms, no clear correlation, etc.
Observe spread (box height), median price (horizontal line), and potential outliers (data points outside the box).
Identify the spread of area values and most common ranges.
See the dominant furnishing type and relative distribution across categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Provides a comprehensive overview of various aspects for informed decision-making.
Combine insights from each visualization to understand how different factors interact and might influence property value, selection, and pricing strategies.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# heatmap of area bedrooms price and bathrooms
import seaborn as sns
sns.heatmap(df[['area', 'bedrooms', 'bathrooms', 'price']].corr(), annot=True)
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is appropriate here because it visually represents the correlation coefficients between all pairs of continuous variables in the provided data: area, bedrooms, bathrooms, and price. The color intensity and values within the squares indicate the strength and direction of the relationships:

##### 2. What is/are the insight(s) found from the chart?

hese indicate potentially significant positive or negative correlations between the variables.
Compare the intensity of the color: Brighter colors represent stronger correlations.
Pay attention to the values within the squares: These numbers quantify the correlation coefficient, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive:

Identify potential relationships: Understanding correlations can provide valuable insights for businesses:
Pricing strategies: Consider the correlation between features (e.g., higher price for larger areas or more bedrooms) to inform pricing decisions.
Marketing strategies: Depending on the correlations, specific features can be emphasized in marketing materials to target potential buyers' preferences.
Negative :

Misinterpreting correlation as causation: Remember, correlation doesn't necessarily imply causation. A correlation between two variables might be influenced by other factors.
Overreliance on correlations: Use correlation insights alongside other data and analysis for comprehensive decision-making.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import seaborn as sns
import matplotlib.pyplot as plt

correlation_matrix = df.corr(numeric_only=True)

# Plotting the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Matrix Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

Calculates correlations between numerical variables in the dataset.
Creates a heatmap using Seaborn, visually representing correlations.

##### 2. What is/are the insight(s) found from the chart?

To explore potential relationships between variables by:
Highlighting patterns using color intensity and values within squares.
Helping identify positive (red), negative (blue), or weak correlations.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df, hue='furnishingstatus', palette='bright', markers=["D", "o", "s"])
plt.suptitle('Scatter Plot Matrix', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A Pair Plot is chosen because it effectively visualizes relationships between all pairs of numerical variables in the data. It combines:
Scatter plots: To show potential correlations between continuous variables.
Histograms/density plots: To visualize the distribution of each individual variable.
Additionally, coloring by furnishingstatus allows observing how this categorical variable might influence the relationships between other numerical variables.

##### 2. What is/are the insight(s) found from the chart?

Look for potential trends in the scatter plots:
Do certain variables seem to increase or decrease together (positive correlation)?
Do they seem unrelated (no clear correlation)?
Observe the diagonal plots:
Identify the spread and shape of each variable's distribution.
Analyze how the color coding by furnishingstatus affects the scatter plots:
Do trends appear differently for different furnishing categories?

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum().reset_index()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
df.head()

In [None]:
# Handling Outliers & Outlier treatments
sns.boxplot(data = df,x='bedrooms')

In [None]:
# Handling Outliers & Outlier treatments
import pandas as pd

# Convert columns to numeric if they contain numerical data
numeric_cols = df.select_dtypes(include=[np.number]).columns

# Calculate quartiles and IQR
Q1 = df[numeric_cols].quantile(0.25)
Q3 = df[numeric_cols].quantile(0.75)
IQR = Q3 - Q1

# Calculate lower and upper bounds for outliers
outlier1 = Q1 - 1.5 * IQR
outlier2 = Q3 + 1.5 * IQR

# Identify outliers in each feature
outliers = ((df[numeric_cols] < outlier1) | (df[numeric_cols] > outlier2)).sum()

# Print the count of outliers in each feature
print("Number of outliers:")
print(outliers)

# Handling outliers (you can choose an appropriate method based on your dataset and requirements)
# For example, you can replace outliers with the median value
df_clipped = df[numeric_cols].clip(outlier1, outlier2, axis=1)

In [None]:
df.head()

In [None]:
# replace outliers with 25percentile and 75percentile
for i in ['price','area','bedrooms','bathrooms','stories','parking']:
  Q1 = df[i].quantile(0.25)
  Q3 = df[i].quantile(0.75)
  IQR = Q3 - Q1
  df[i] = np.where(df[i] < (Q1 - 1.5*IQR),(Q1 - 1.5*IQR),np.where(df[i] > (Q3 + 1.5*IQR) ,(Q3 + 1.5*IQR),df[i]))

In [None]:
# Removed the outliers
Q1 = df[numeric_cols].quantile(0.25)
Q3 = df[numeric_cols].quantile(0.75)
IQR = Q3 - Q1
outlier1  = Q1 - 1.5*IQR
outlier2  = Q3 + 1.5*IQR

((df[numeric_cols] < outlier1) | (df[numeric_cols] > outlier2)).sum()

In [None]:
# outliers removed from duration columns
sns.boxplot(data = df,x='bedrooms')

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
x = df.drop(columns = 'price')
y = df['price']
x = pd.get_dummies(x)
x.head(2)
x.fillna(0, inplace=True)
y.fillna(0, inplace=True)

In [None]:
# Map 'yes' and 'no' to 1 and 0 for binary columns
binary_cols = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']
df[binary_cols] = df[binary_cols].apply(lambda x: x.map({'yes': 1, 'no': 0}))

In [None]:
tri_cols = ['furnishingstatus']
df[tri_cols] = df[tri_cols].apply(lambda x: x.map({'semi-furnished': 1, 'unfurnished': 0,'furnished':2}))

In [None]:
df.head(20)

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [None]:
df = pd.get_dummies(df)
print(df.columns)

In [None]:
print(df.head())

In [None]:
# # Drop 'furnishingstatus' column for now to avoid issues with one-hot encoding
# df.drop('furnishingstatus', axis=1, inplace=True)

In [None]:
# Standardize the features
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
df_scaled = pd.DataFrame(df_scaled, columns=df.columns)

In [None]:
print(df_scaled.head())

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Import the library use in feature selection

from sklearn.feature_selection import f_regression
f_scores = f_regression(x,y)
f_scores[1]

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import MinMaxScaler

# Select the numerical columns to scale
numerical_cols = ['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'parking','furnishingstatus']

# Initialize the scaler
scaler = MinMaxScaler()

# Fit and transform the numerical columns
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

print(df.head())

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
scalar = StandardScaler()

In [None]:
scalar.fit(x)
standardized_data = scalar.transform(x)
standardized_data

In [None]:
X = standardized_data
Y = df['price']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)

In [None]:
y_train

In [None]:
x_train.shape

In [None]:
x_test.shape

In [None]:
y_train.shape

In [None]:
y_test.shape

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.linear_model import LinearRegression
# ML Model - 1 Implementation
reg = LinearRegression().fit(x_train,y_train)
# Fit the Algorithm
reg.fit(x_train,y_train)
# Predict on the model
y_pred = reg.predict(x_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

Linear_MAE = mean_absolute_error(y_test,y_pred)
print("MAE :" ,Linear_MAE)
Linear_MSE = mean_squared_error(y_test,y_pred)
print("MSE :" ,Linear_MSE)
Linear_RMSE = np.sqrt(Linear_MSE)
print("RMSE :" ,Linear_RMSE)
Linear_r2 = r2_score(y_test,y_pred)
print("r2 :" ,Linear_r2)
# Adjusted R-squared formula = 1 - ( (1-R^2) * (n-1) / (n-p-1) )
Linear_adjusted_r2 = 1-(1-r2_score((y_test), (y_pred)))*((x_test.shape[0]-1)/(x_test.shape[0]-x_test.shape[1]-1))
print("Adjusted_r2 :" ,Linear_adjusted_r2)
Linear_Dataframe = pd.DataFrame(zip(y_test, y_pred), columns = ['actual', 'pred'])
Linear_Dataframe

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
lasso = Lasso()
parameters = {'alpha' : [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100,110,120,150,160,170,200]}

# Fit the Algorithm
lasso_regressor = GridSearchCV(lasso, parameters,scoring='neg_mean_squared_error' ,cv = 5)
lasso_regressor.fit(x_train, y_train)
# Predict on the model
lasso = Lasso(alpha = 200, max_iter =  3000)
lasso.fit(x_train,y_train)


In [None]:
accuracy = reg.score(x_test, y_test) * 100
print(f"Accuracy: {accuracy:.2f}%")

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
# Replace 'path_to_housing.csv' with the correct file path
df = pd.read_csv('Housing.csv')

# Assuming 'price' is the target variable
X = df.drop(columns=['price'])
y = df['price']

# Perform one-hot encoding for categorical variables
X_encoded = pd.get_dummies(X)

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

# Train a decision tree classifier
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Extract the tree structure
tree_structure = clf.tree_

# Helper function to generate questions
def generate_questions(node):
    questions = []
    if tree_structure.children_left[node] != tree_structure.children_right[node]:
        # Internal node
        feature_index = tree_structure.feature[node]
        threshold = tree_structure.threshold[node]
        questions.append(f"Is the {X_encoded.columns[feature_index]} {'>' if threshold else '<='} {threshold:.2f}?")
        questions.extend(generate_questions(tree_structure.children_left[node]))
        questions.extend(generate_questions(tree_structure.children_right[node]))
    else:
        # Leaf node
        questions.append(f"Predicted price range: {clf.classes_[tree_structure.value[node].argmax()]}")
    return questions

# Generate questions starting from the root node (index 0)
tree_questions = generate_questions(0)

# Print the generated questions
print("Decision Tree Questions:")
for question in tree_questions:
    print(question)





**ML MODEL 3**


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.svm import SVC

# Generate synthetic data
X, y = make_classification(n_samples=100, n_features=2, n_classes=2, n_clusters_per_class=1, n_redundant=0, random_state=42)

# Train SVM classifiers with different kernels
svm_linear = SVC(kernel='linear')
svm_linear.fit(X, y)

svm_poly = SVC(kernel='poly', degree=3)
svm_poly.fit(X, y)

svm_rbf = SVC(kernel='rbf', gamma='auto')
svm_rbf.fit(X, y)

# Plot decision boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))

plt.figure(figsize=(18, 6))

# Plot decision boundary for SVM with linear kernel
plt.subplot(1, 3, 1)
Z = svm_linear.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
plt.title('SVM with Linear Kernel')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Plot decision boundary for SVM with polynomial kernel
plt.subplot(1, 3, 2)
Z = svm_poly.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
plt.title('SVM with Polynomial Kernel')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Plot decision boundary for SVM with RBF kernel
plt.subplot(1, 3, 3)
Z = svm_rbf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
plt.title('SVM with RBF Kernel')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.show()


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.svm import SVR
from sklearn.metrics import r2_score

# Assuming 'price' is the target variable
X = df.drop(columns=['price'])
y = df['price']

# Perform one-hot encoding for categorical variables
encoder = OneHotEncoder()
X_encoded = pd.DataFrame(encoder.fit_transform(X.select_dtypes(include=['object'])).toarray(), columns=encoder.get_feature_names_out(X.select_dtypes(include=['object']).columns))
X_encoded[X.select_dtypes(exclude=['object']).columns] = X.select_dtypes(exclude=['object'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize the SVM regressor
svm = SVR()

# Perform grid search for hyperparameter tuning with StratifiedKFold
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [0.1, 0.01, 0.001], 'kernel': ['rbf', 'poly', 'sigmoid']}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(svm, param_grid, cv=cv, scoring='r2')
grid_search.fit(X_train_scaled, y_train)

# Get the best parameters and the best estimator
best_params = grid_search.best_params_
best_svm = grid_search.best_estimator_

# Make predictions on the test data
y_pred = best_svm.predict(X_test_scaled)

# Calculate the R-squared (R2) score
r2 = r2_score(y_test, y_pred)

# Print the best parameters and R-squared (R2) score
print("Best Parameters:", best_params)
print("R-squared (R2) score:", r2)

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***