##### ReadME infos about homework : 
 **Airbnb House Price Prediction Project** In this project, we will develop a machine learning model to predict the prices of houses listed on Airbnb. 
#### Data Information: 
 * **Data Set:** House listings collected from Airbnb 
 * **Variables:** Price, number of rooms, number of bathrooms, location, number of guests, number of reviews, rating, house type, etc.
#### EDA & Data Preprocessing:  
* We will explore the data and process missing data. 
*We will visualise the data and examine relationships and trends. 
* We will apply data normalisation and transformation if necessary.
#### Feature Engineering:  
* We will try to improve the performance of our model by deriving new features. 
* For example, we can add neighbourhood features using location information. 
#### Model Training:
* We will train different machine learning models and select the best performing model.
* We will optimise the model hyperparameters. 
#### Model Evaluation: 
* We will evaluate our model on test data and measure the prediction accuracy. 
* We will analyse the performance of the model using different metrics.
 
 
**Table of Content**
1. **Data Information:**
2. **EDA & Data Preprocessing:**
3. **Feature Engineering:**
4. **Model Training:**
5. **Model Evaluation:**


# Importing Library

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor


: 

# Data Information



**Additional Features**
The website data was reviewed, and additional data was included. This allowed for the extraction of extra features to be checked. The analysis showed that extra feature data was included. For more information, you can refer to this website: [https://insideairbnb.com/get-the-data](https://insideairbnb.com/get-the-data)

**Data Dict**

- 'id': Unique identifier for the listing.
- 'host_id': Unique identifier for the host.
- 'host_name': Name of the host.
- 'host_about': Description or bio of the host.
- 'host_response_time': Time taken by the host to respond to inquiries.
- 'host_response_rate': Percentage of inquiries to which the host responds.
- 'host_acceptance_rate': Percentage of booking requests accepted by the host.
- 'host_verifications': Types of verifications the host has undergone (e.g., email, phone, government ID, etc.).
- 'neighbourhood_cleansed': The neighbourhood group as geocoded using the latitude and longitude against neighborhoods.
- 'property_type': Self-selected property type. Hotels and Bed and Breakfasts are described as such by their hosts in this field.
- 'room_type': The type of room available for booking.
- 'accommodates': The maximum capacity of the listing.
- 'bathrooms_text': The textual description of the number of bathrooms in the listing.
- 'beds': The number of bed(s) in the listing.
- 'price': The daily price in the local currency.
- 'number_of_reviews': The number of reviews the listing has.
- 'review_scores_rating': The overall rating score given by guests in reviews.
- 'review_scores_accuracy': The rating score for accuracy given by guests in reviews.
- 'review_scores_cleanliness': The rating score for cleanliness given by guests in reviews.
- 'review_scores_checkin': The rating score for check-in experience given by guests in reviews.
- 'review_scores_communication': The rating score for communication given by guests in reviews.
- 'review_scores_location': The rating score for location given by guests in reviews.
- 'review_scores_value': The rating score for value given by guests in reviews.

In Extra Data
- 'latitude': Uses the World Geodetic System (WGS84) projection for latitude and longitude.
- 'longitude': Uses the World Geodetic System (WGS84) projection for latitude and longitude.
- 'avaliability_365: avaliability_x. The availability of the listing x days in the future as determined by the calendar. Note a listing may not be available because it has been booked by a guest or blocked by the host.



## Exploring Airbnb House Price Prediction Data

In this section, we will explore the data used for our Airbnb house price prediction project.

- The `data.head()` command shows the first 5 rows of the dataset. This allows us to see the general structure of the dataset and which columns are present.

- The `data.info()` command provides more detailed information about the dataset, including:
  - The data type of each column
  - The number of missing values in each column
  - The minimum, maximum, mean, and standard deviation of each column

- The `data.shape()` command gives the dimensions (number of rows and columns) of the dataset.


: 

In [None]:
data = pd.read_csv("data.csv")
data.head()

: 

In [None]:
data.info()
data.shape

: 

In [None]:
data.describe()

: 

In [None]:
extra_data = pd.read_csv("listings.csv")
extra_data.head()

: 

In [None]:
extra_data.info()
extra_data.shape

: 

# EDA & Data Prepocessing

## Data Processing and Feature Selection

First, we drop non-numerical ('object' dtype) columns as they may not be directly useful for price prediction.

Next, we calculate the correlations between the remaining columns and visualize a correlation matrix. We will focus on features that show high correlation with the price.

Based on the correlation matrix and domain expertise, we select a subset of features that could be useful. In the example, the following features are selected:
- `latitude`: Latitude - location information
- `longitude`: Longitude - location information
- `availability_365`: Availability of listings throughout the year

### Merging Datasets

Now we can merge our main dataset (`data`) with this selected extra data (`extra`). This is usually done via a common column (e.g., listing ID) using 'join'-like functions.


In [None]:

#First, we drop non-numerical ('object' dtype) columns. These columns may not be directly useful for price prediction.
object_columns = extra_data.select_dtypes(include=["object"]).columns
extra_data = extra_data.drop(columns=object_columns)
corr_matrix = extra_data.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Korelasyon Matrisi")
plt.show()

: 

In [None]:
#could be useful feature
extra = extra_data[['latitude', 'longitude', 'availability_365']]
extra

: 

In [None]:
index = [
    "latitude", "longitude", "availability_365"]
for column in index:
    print(extra[column].value_counts(), "\n---------------------------------------------------------------")


: 

availability_365  is looking more useful.

##### Data Cleaning and Preprocessing

The purpose of the code is to clean and preprocess specific columns in a dataset (`data`) using regular expressions (`re` module) and lambda functions. It performs the following operations:

- Extracts the numerical value from the "bathrooms_text" column and assigns it to the same column.
- Converts the "price" column values to float by removing dollar signs, commas, and whitespace.
- Converts the "host_response_rate" column values to float by removing percentage signs, commas, and whitespace.
- Converts the "host_acceptance_rate" column values to float by removing percentage signs, commas, and whitespace.


In [None]:
import re
data["bathrooms_text"] = data["bathrooms_text"].apply(lambda text: re.findall(r'\d+', str(text))[0] if len(re.findall(r'\d+', str(text))) else None)
data["price"] = data["price"].apply(lambda text: float(str(text).replace("$","").replace(",","").strip()))
data["host_response_rate"] = data["host_response_rate"].apply(lambda text: float(str(text).replace("%","").replace(",","").strip()))
data["host_acceptance_rate"] = data["host_acceptance_rate"].apply(lambda text: float(str(text).replace("%","").replace(",","").strip()))

: 

In [None]:
df = pd.DataFrame(data = data)
df["availability_365"] = extra_data["availability_365"]
df.head()

: 

In [None]:
df.dtypes

: 

In [None]:
#Distirbution of Price

print(df["price"].value_counts())
plt.figure(figsize=(10, 8))
sns.kdeplot(df["price"], fill=True, color="b")
plt.xlabel('Price')
plt.ylabel('Density')
plt.title('Price Distribution (KDE)')
plt.show()


: 

: 

#### Numerical Features

In [None]:
list(set(df.dtypes.tolist()))

: 

In [None]:
df_num = df.select_dtypes(include = ['float64', 'int64'])
df_num.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8);

: 

In [None]:

## Numerical variables are usually of 2 type
## Continous variable and Discrete Variables

discrete_feature=[feature for feature in df_num if len(df[feature].unique())<25 ]
print(discrete_feature)
print("Discrete Variables Count: {}".format(len(discrete_feature)))

: 

In [None]:
##  Find the realtionship between distcrete variables and Price

for feature in discrete_feature:
    data=df.copy()
    data.groupby(feature)['price'].median().plot.bar()
    plt.xlabel(feature)
    plt.ylabel('Price')
    plt.title(feature)
    plt.show()

: 

#### Continous Variables

In [None]:
continuous_feature=[feature for feature in df_num if feature not in discrete_feature]
print("Continuous feature Count {}".format(len(continuous_feature)))

: 

In [None]:
## analyse the continuous values by creating histograms to understand the distribution

for feature in continuous_feature:
    data=df.copy()
    data[feature].hist(bins=25)
    plt.xlabel(feature)
    plt.ylabel("Count")
    plt.title(feature)
    plt.show()

: 

#### Categorical Variables


In [None]:
df_cat = [col for col in df.columns if df[col].dtype == 'object']
df_cat

: 

In [None]:

features = ['room_type', 'neighbourhood_cleansed']


for feature in features:
    counts = df[feature].value_counts()
    plt.figure(figsize=(10, 6))
    sns.barplot(x=counts.index, y=counts.values)
    plt.title(feature)
    plt.xlabel(feature)
    plt.ylabel('Count')
    plt.xticks(rotation=90) 
    plt.show()

: 

In [None]:
sns.set(style='darkgrid')
plt.figure(figsize=(8, 8))
sns.countplot(y='property_type', data=df, order=df['property_type'].value_counts().index, palette='pastel')
plt.ylabel('Property Type', fontsize=20, weight='bold', color='black')
plt.show() 

: 

In [None]:
import seaborn as sns
features = ['room_type', 'neighbourhood_cleansed']


for feature in features:
    plt.figure(figsize=(10, 6))
    sns.barplot(x=feature, y='price', data=df, estimator='mean')
    plt.title(feature + ' and Price Relationship')
    plt.xlabel(feature)
    plt.ylabel(' Price')
    plt.xticks(rotation=90)  
    plt.show()

: 

In [None]:
plt.figure(figsize=(12, 6))  
bars = plt.bar(df['property_type'], df['price'], color='skyblue')

plt.xlabel('Property Type', fontsize=12)
plt.ylabel('Price ($)', fontsize=12)
plt.title('Price by Property Type', fontsize=14)
plt.xticks(rotation=90, ha='right', fontsize=10)  


plt.subplots_adjust(bottom=0.3)


plt.show()

: 

# Feature Engineering

In [None]:
useless_features= ['id', 'host_id', 'host_name', 'host_about', 'host_verifications']
df_main = df.drop(columns=[col for col in df.columns if col in useless_features])
df_main.head()

: 

#### Handling Missing Values

1. **Investigating the Quantity and Distribution of Missing Values:**
   - Calculate the number and percentage of missing values for each feature.
   - Analyze whether missing values are randomly distributed or follow a specific pattern.

2. **Determining the Reasons for Missing Values:**
   - Try to understand the reasons for missing values, such as data collection errors, incorrect data entry, or intentionally left blank fields.

3. **Choosing Methods to Handle Missing Values:**
   - **Delete Missing Values:** This method can be used when there are few missing values and they are randomly distributed.
   - **Impute Missing Values:** Missing values can be estimated using methods such as mean, median, hot-deck imputation, etc.
   - **Define Missing Values as a Category:** If missing values are believed to have a unique meaning, they can be defined as a new category.


In [None]:
df_main.isnull().sum()


: 

### Visualizing Impact of Missing Values on Price

Create bar charts comparing the median prices of observations containing missing values for each feature with observations that do not contain missing values. This way, you can visually see the impact of missing values on price.







In [None]:
features_with_na=[features for features in df_main.columns if df_main[features].isnull().sum()>1]
for feature in features_with_na:
    print("{}: {}% missing values".format(feature,np.round(df_main[feature].isnull().mean(),4)))

: 

In [None]:
# For Categoric Variables
features_nan=[feature for feature in df_main.columns if df_main[feature].isnull().sum()>1 and df_main[feature].dtypes=='O']
def replace_cat_feature(dataset,features_nan):
    data=df_main.copy()
    data[features_nan]=data[features_nan].fillna('Missing')
    return data

df_main=replace_cat_feature(df_main,features_nan)

df_main[features_nan].isnull().sum()

: 

In [None]:
# For Numarical Varibles
numerical_with_nan=[feature for feature in df_main.columns if df_main[feature].isnull().sum()>1 and df_main[feature].dtypes!='O']

for feature in numerical_with_nan:
    
    median_value=df_main[feature].median()
    
    df_main[feature].fillna(median_value,inplace=True)
    
df_main[numerical_with_nan].isnull().sum()
    

: 

In [None]:
df_main.isnull().sum()

: 

In [None]:
df_main.dtypes

: 

### Encoding Categorical Variables using Label Encoder

The provided code segment demonstrates how to use the `LabelEncoder` from `sklearn.preprocessing` to encode categorical variables in a DataFrame (`df_main`).


In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for column in df_main.columns:
    if df_main[column].dtype == 'object': 
        df_main[column] = label_encoder.fit_transform(df_main[column])

: 

In [None]:
df_main

: 

#### Feature Scaling
######  Scaling Features using Min-Max Scaler

The provided code snippet demonstrates how to scale features in a DataFrame (`df_main`) using Min-Max scaling with `MinMaxScaler` from `sklearn.preprocessing`.


In [None]:
feature_scale=[feature for feature in df_main.columns if feature not in ['price']]

from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
scaler.fit(df_main[feature_scale])

: 

In [None]:
scaler.transform(df_main[feature_scale])

: 

In [None]:
df_main = pd.concat([df_main[['price']].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(df_main[feature_scale]), columns=feature_scale)],
                    axis=1)
df_main

: 

In [None]:
y = df_main["price"]
X = df_main.drop(columns=["price"])

: 

# Train Model

### Brief Information About Algorithms

**Linear Regression:**
Linear regression is a simple regression model that attempts to predict a target variable using a linear combination of input features. It is used to model relationships that can be expressed with straight lines.

**XGBoost (eXtreme Gradient Boosting):**
XGBoost is an enhanced version of Gradient Boosting, which is a tree-based algorithm. It has achieved success in many data science competitions and has become popular. It can be used for regression and classification problems. One of the main advantages of XGBoost is its ability to better model complex relationships and interactions.

**Why Choose These Algorithms:**

- **Linear Regression:** We choose linear regression for its simplicity and interpretability. It works well when there is a linear relationship between features and the target variable, making it a good starting point for regression tasks.
  
- **XGBoost:** XGBoost is chosen for its excellent performance in handling complex relationships and high-dimensional datasets. It is particularly effective in capturing non-linear relationships and interactions between features, making it a powerful choice for regression and classification tasks in diverse datasets.

These algorithms are selected based on their strengths and suitability for different types of relationships and data complexities encountered in predictive modeling tasks.


In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

model_xgb = XGBRegressor(objective='reg:squarederror')
model_xgb.fit(X_train, y_train)

: 


# Evaluating Model Performance: Mean Squared Error (MSE)

To assess the performance of a regression model, the Mean Squared Error (MSE) metric is commonly utilized. MSE quantifies the average squared difference between predicted values and actual values. Lower MSE values indicate superior model performance.

The formula to compute MSE is as follows:

MSE = (1/n) * Σ (yᵢ - ŷᵢ)²

Here:
- \( n \) represents the number of samples
- \( y_i \) denotes the actual target value for the \( i \)th sample
- \( \hat{y}_i \) signifies the predicted target value for the \( i \)th sample

Once you have obtained the predicted values from your regression model and have the corresponding actual target values, you can use the above formula to calculate the MSE and gauge the model's accuracy.




In [None]:
# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

: 

In [None]:
y_pred_xgb = model_xgb.predict(X_test)

mse = mean_squared_error(y_test, y_pred_xgb)
print("Mean Squared Error:", mse)

: 