# Data Science
In this part we will go over example tasks a data scientist would perform.
Those include:
- **Exploratory data analysis**: This is the process of investigating and summarizing data sets in order to gain insights and formulate hypotheses. It involves visualizing and summarizing data using various statistical and graphical techniques in order to understand patterns, trends, and relationships in the data,
- **Feature engineering**: This refers to the process of transforming raw data into features that can be used as inputs to machine learning algorithms. It involves selecting, extracting, and transforming relevant features from the data to improve the performance of the models,
- **Feature selection**: This is the process of selecting a subset of relevant features from a larger set of features in order to improve the performance of the models. It involves using various techniques such as correlation analysis, mutual information, and regularization to identify the most important features,
- **Splitting data**: This is the process of dividing a data set into two or more subsets, typically a training set and a testing set. The training set is used to train the machine learning models, while the testing set is used to evaluate their performance,
- **Model selection**: This is the process of selecting the most appropriate machine learning model for a particular problem. It involves evaluating various models based on their performance on a given data set and selecting the one that performs the best,
- **Model validation**: This is the process of evaluating the performance of machine learning models using validation techniques such as cross-validation and holdout validation. It involves assessing the accuracy, precision, recall, and other metrics of the models on a separate test set in order to avoid overfitting and ensure generalization,
- **Metrics**: These are measures used to evaluate the performance of machine learning models. They include accuracy, precision, recall, F1 score, ROC AUC, and many others,
- **Report**: This is a document that summarizes the findings and results of a data science project. It typically includes a description of the problem, the data used, the methods and techniques used, the results obtained, and the conclusions drawn. The report should be clear, concise, and well-organized, and it should communicate the findings to a non-technical audience.

## Setup
Importing libraries and getting the data. 

In [None]:
import random
import os

import psycopg
import folium
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from plotnine import *
from geopy.distance import great_circle as GRC
from sklearn.model_selection import train_test_split, RepeatedKFold
from sklearn.metrics import mean_squared_error, root_mean_squared_error
from xgboost import XGBRegressor

In [None]:
conn = psycopg.connect(
   dbname=os.environ.get("DB_NAME"),
   user=os.environ.get("DB_USER"),
   password=os.environ.get("DB_PASSWORD"),
   host=os.environ.get("DB_HOST"),
   port= os.environ.get("DB_PORT")
)
cursor = conn.cursor()

In [None]:
data = pd.read_sql('SELECT * FROM vw_airbnb', con=conn)
conn.close ()

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
# All rows having at least one -1 in numeric columns
numeric_cols = data.select_dtypes(include=[np.number]).columns
neg1_mask = (data[numeric_cols] == -1)

rows_with_neg1 = data[neg1_mask.any(axis=1)]
rows_with_neg1.head()

In [None]:
data = data.replace(-1, np.nan)

## Exploratory data analysis

In [None]:
data.head()

In [None]:
data.shape

#### Check missing values

In [None]:
missing_dict = [{col: data[col].isnull().sum()/data[col].shape[0]} for col in data.columns]

In [None]:
missing_dict

In [None]:
data = data.dropna()

### Check individual variables
Make sure they make sense given what you know about the data.

In [None]:
data.iloc[0].center_latitude, data.iloc[0].center_longitude

*longitude* & *latitude*

In [None]:
# Checking longitude and latitude values (are they all in Paris?).
center = [data.iloc[0].center_latitude, data.iloc[0].center_longitude]

# creating map
map = folium.Map(location = center, zoom_start = 12)
random_rows = random.sample(list(data.iterrows()), k=1000)

min_max_idx = [data['latitude'].idxmin(), data['longitude'].idxmin(), data['latitude'].idxmax(), data['longitude'].idxmax()]
random_rows.extend(list(data.iloc[min_max_idx].iterrows()))

for i, j in random_rows:
    location = [j['latitude'], j['longitude']]
    folium.Marker(location, popup = f'Price: {j["price"]}').add_to(map)
    if i == 999:
        break

In [None]:
map

*price*

In [None]:
data["price"]

In [None]:
# Problem with price... Its a string
data["price"] = data["price"].apply(lambda x: float(x.replace("$", "")))

In [None]:
data["price"].nlargest(n=10)

In [None]:
data["price"].nsmallest(n=10)

In [None]:
# Missing values were encoded as -1 in the dataset so we remove the rows
data = data.replace(-1, np.nan)
data = data.dropna()

In [None]:
# Checking the price histogram
(
    ggplot(data, aes(x="price"))
    + geom_histogram()
)

In [None]:
data['price'].value_counts().iloc[:15]

In [None]:
(
    ggplot(data, aes(x="price"))
    + geom_histogram(binwidth=15, fill="steelblue", color="white")
    + labs(title="Listing price distribution", x="Price (USD)", y="Count")
    + theme_minimal()
)

*minimum_nights*

In [None]:
data["minimum_nights"].nlargest(n=10)

In [None]:
data["minimum_nights"].nsmallest(n=10)

In [None]:
data['minimum_nights'].value_counts().iloc[:30]

*city_name*

In [None]:
data["city_name"].nunique()

*room_type_name*

In [None]:
data["room_type_name"].unique()

In [None]:
data["room_type_name"].value_counts()

*neighbourhood_name*

In [None]:
data["neighbourhood_name"].unique()

In [None]:
data["neighbourhood_name"].value_counts()

*Amenities*

In [None]:
am = set()

In [None]:
for i, amenities in data.amenities.items():
    amenities_split = amenities.split(",")
    for amenity in amenities_split:
        am.add(amenity)

In [None]:
am

*Features*

In [None]:
fe = set()

In [None]:
for i, features in data.features.items():
    features_split = features.split(",")
    for feature in features_split:
        fe.add(feature)

In [None]:
fe

*Cancelation Policy*

In [None]:
data["cancel_policy_name"].value_counts()

*bed_type_name*

In [None]:
data["bed_type_name"].value_counts()

*property_type_name*

In [None]:
data["property_type_name"].value_counts()

## Feature engineering

In [None]:
data.head()

In [None]:
data["longitude_to_center"] = data[["longitude", "center_longitude"]].apply(lambda x: x["longitude"] - x["center_longitude"], axis=1)
data["latitude_to_center"] = data[["latitude", "center_latitude"]].apply(lambda x: x["latitude"] - x["center_latitude"], axis=1)

In [None]:
data["distance_to_center"] = data[["longitude", 
                                    "latitude", 
                                    "center_longitude", 
                                    "center_latitude"]].apply(lambda x: GRC((x["latitude"], x["longitude"]), 
                                                                            (x["center_latitude"], x["center_longitude"])).km, axis=1)

In [None]:
neighbourhood_dummies = pd.get_dummies(data["neighbourhood_name"], drop_first=True)
room_type_dummies = pd.get_dummies(data["room_type_name"], drop_first=True)
bed_type_dummies = pd.get_dummies(data["bed_type_name"])
property_type_dummies = pd.get_dummies(data["property_type_name"])
cancellation_policy_dummies = pd.get_dummies(data["cancel_policy_name"], drop_first=True)

In [None]:
cancellation_policy_dummies

In [None]:
cancellation_policy_dummies["strict"] = (cancellation_policy_dummies["strict"] | 
                                        cancellation_policy_dummies["super_strict_30"] |  
                                        cancellation_policy_dummies["super_strict_60"])
cancellation_policy_dummies = cancellation_policy_dummies.drop(columns=["super_strict_60", "super_strict_30"])

In [None]:
bed_type_dummies

In [None]:
bed_type_dummies = bed_type_dummies.drop(columns=["Couch", "Futon", "Airbed"])

In [None]:
property_type_dummies = property_type_dummies.drop(columns=["Boutique hotel", "Townhouse", "Guesthouse", "Dorm",                     
                                                            "Hostel", "Boat", "Serviced apartment", "Cabin", "Villa",                    
                                                            "Timeshare", "Earth House", "Camper/RV", "Cave", "Other",
                                                            "Bungalow", "Igloo", "Treehouse", "Tipi", "Chalet"])

In [None]:
data = data.join(neighbourhood_dummies)
data = data.join(room_type_dummies)
data = data.join(bed_type_dummies)
data = data.join(property_type_dummies)
data = data.join(cancellation_policy_dummies)

In [None]:
data.head()

## Feature selection

In [None]:
data

In [None]:
df = data.drop(["neighbourhood_name", "longitude", "latitude", 'listing_given_id',
                "property_type_name", "center_longitude", "center_latitude", 
                "price", "minimum_nights", "room_type_name", "bed_type_name",
                "cancel_policy_name", "features", "amenities", "city_name"], axis=1)
target = data["price"]

In [None]:
df.columns

In [None]:
df.head()

## Splitting data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.20, random_state=42)

## Model Selection

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor

In [None]:
numeric_columns = ["accommodates", "bathrooms",	"bedrooms",	
                    "beds", "longitude_to_center",
                    "latitude_to_center",	"distance_to_center"]

In [None]:
numeric_columns = ["longitude_to_center", 
                    "latitude_to_center",	"distance_to_center"]

In [None]:
scaler = StandardScaler().fit(X_train[numeric_columns])

In [None]:
X_train_scaled = scaler.transform(X_train[numeric_columns])
X_test_scaled = scaler.transform(X_test[numeric_columns])

In [None]:
X_train = X_train.drop(columns=numeric_columns)
X_test = X_test.drop(columns=numeric_columns)

In [None]:
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

X_train = pd.concat([X_train, pd.DataFrame(X_train_scaled, columns=numeric_columns)], axis=1)
X_test = pd.concat([X_test, pd.DataFrame(X_test_scaled, columns=numeric_columns)], axis=1)

In [None]:
model = LinearRegression()

In [None]:
model = XGBRegressor()

In [None]:
model = MLPRegressor(random_state=42, early_stopping=True, verbose=True)

In [None]:
model.fit(X_train, y_train)

In [None]:
preds = model.predict(X_test)

In [None]:
mean_squared_error(y_test, preds)

In [None]:
root_mean_squared_error(y_test, preds) # Relative mean squared error

In [None]:
importances = model.feature_importances_ # Only for XGBoost model
importances

In [None]:
features = X_train.columns

fi = (pd.DataFrame({'feature': features, 'importance': importances})
        .sort_values('importance', ascending=False))

fi.head(10)

In [None]:
(
    ggplot(fi, aes(x='reorder(feature, importance)', y='importance'))
    + geom_col(fill='steelblue')
    + coord_flip()
    + labs(title='XGBoost Feature Importances', x='Feature', y='Importance')
    + theme_minimal()
)

## Model Validation

In [None]:
#TODO

## Report

In [None]:
#TODO