# Airbnb Data Preparation 🏡

---

![](https://www.kent.ac.uk/events/images/paris-notre-dame-1920x1280.jpg)

---

In this exercise, you are on your own! You will play with data from Paris Airbnb listings (open sourced here : http://insideairbnb.com/get-the-data.html). You should be familiar with this data as we explored it in data visualization session.

Your mission is to perform a **complete data cleaning &amp; prepration**, and then **to fit regression model that predicts the listing price**!

## Guidelines

You are free to perform the analysis the way you want. However we recommend you to explore and run your analysis through Jupyter notebook and to present your conclusions in a structured Jupyter notebook with text and titles for organizing your results.

Please find below some structure you can follow to proceed in this challenge. Feel free to explore on your own and to go off the beaten tracks.

### 1. EDA

1. Load the dataset (most likely into a Pandas DataFrame). You should already have downloaded the dataset during *Data Visualization* challenge.
2. Describe briefly the data, explore important variables and possible relationships between variables.
3. Clean your data (missing data, N/A values, duplicated lines, outliers, data not properly loaded, etc.) and act on it.
4. Ask questions about your data and try answering them. Visualize your data and understand it.

### 2. Features Preparation

1. Think about the features you want to select for your models, prepare them if needed (categorical, ),
2. Create new smart features
3. Scale your features

### 3. Model Training

1. Split your dataset into training set and test set. You can also split into training, validation and testing set if appropriate (for optimizing hyperparameters for example or comparing performance of your models).
2. Choose a model that you think appropriate and train it on your data.

### 4. Conclusions

1. Evaluate your model, think about what metrics make more sense.
2. Iterate on your analysis: go back to your EDA and features and think whether you can improve your data preparation/
3. Present your results in a clear and visual way.

---

## I - Exploration

#### Import

In [35]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns 


#### Loading the dataset

In [36]:
data = pd.read_csv("/home/wafa/Bureau/listings.csv")
data

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,2577,https://www.airbnb.com/rooms/2577,20181207151406,2018-12-08,Loft for 4 by Canal Saint Martin,"100 m2 loft (1100 sq feet) with high ceiling, ...",The district has any service or shop you may d...,"100 m2 loft (1100 sq feet) with high ceiling, ...",none,,...,t,,{PARIS},t,f,strict_14_with_grace_period,f,f,1,
1,3109,https://www.airbnb.com/rooms/3109,20181207151406,2018-12-08,zen and calm,Appartement très calme de 50M2 Belle lumière D...,I bedroom appartment in Paris 14,I bedroom appartment in Paris 14,none,,...,t,,{PARIS},f,f,flexible,f,f,1,0.29
2,5396,https://www.airbnb.com/rooms/5396,20181207151406,2018-12-08,Explore the heart of old Paris,"Cozy, well-appointed and graciously designed s...","Small, well appointed studio apartment at the ...","Cozy, well-appointed and graciously designed s...",none,"You are within walking distance to the Louvre,...",...,t,,{PARIS},t,f,strict_14_with_grace_period,f,f,1,1.29
3,7397,https://www.airbnb.com/rooms/7397,20181207151406,2018-12-08,MARAIS - 2ROOMS APT - 2/4 PEOPLE,"VERY CONVENIENT, WITH THE BEST LOCATION !",PLEASE ASK ME BEFORE TO MAKE A REQUEST !!! No ...,"VERY CONVENIENT, WITH THE BEST LOCATION ! PLEA...",none,,...,t,7510400829623,{PARIS},f,f,moderate,f,f,1,2.47
4,7964,https://www.airbnb.com/rooms/7964,20181207151406,2018-12-08,Large & sunny flat with balcony !,Very large & nice apartment all for you! - Su...,hello ! We have a great 75 square meter apartm...,Very large & nice apartment all for you! - Su...,none,,...,t,,{PARIS},f,f,strict_14_with_grace_period,f,f,1,0.06
5,8522,https://www.airbnb.com/rooms/8522,20181207151406,2018-12-08,GREAT FLAT w/ CITY VIEW,,Really nice flat located in the 20th district ...,Really nice flat located in the 20th district ...,none,,...,t,,{PARIS},f,f,moderate,f,f,1,0.01
6,9359,https://www.airbnb.com/rooms/9359,20181207151406,2018-12-08,"Cozy, Central Paris: WALK or VELIB EVERYWHERE !",Location! Location! Location! Just bring your ...,"Since I live in the USA, it is difficult to ma...",Location! Location! Location! Just bring your ...,none,,...,t,,{PARIS},f,f,strict_14_with_grace_period,t,t,1,
7,9952,https://www.airbnb.com/rooms/9952,20181207151406,2018-12-08,Paris petit coin douillet,,Make your stay in Paris a perfect experience. ...,Make your stay in Paris a perfect experience. ...,none,,...,t,,{PARIS},f,f,strict_14_with_grace_period,f,f,1,0.24
8,10010,https://www.airbnb.com/rooms/10010,20181207151406,2018-12-08,Paris view from my balcony - B&B,"A real ""home from home"" in the heart of Paris ...",Enjoy a slice of Parisian history and stay in ...,"A real ""home from home"" in the heart of Paris ...",none,"This area typically Parisian, is nice, trendy ...",...,t,,{PARIS},f,f,strict_14_with_grace_period,f,f,6,1.33
9,10270,https://www.airbnb.com/rooms/10270,20181207151406,2018-12-08,Be charmed by real Paris 11th - B&B,Here you will find a true home from home in Pa...,Here you will find a true home from home so th...,Here you will find a true home from home in Pa...,none,"This area typically Parisian, is nice, trendy ...",...,t,,{PARIS},f,f,strict_14_with_grace_period,f,f,6,1.46


#### Describing the data

How big is your dataset?
What does it contain (columns and rows)? 
What type of variables does it contain (continuous, discrete)?

In [37]:
data.shape #(59881, 96)
data.duplicated().sum() #0
data.head(n=1)
#data.isnull().sum()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,2577,https://www.airbnb.com/rooms/2577,20181207151406,2018-12-08,Loft for 4 by Canal Saint Martin,"100 m2 loft (1100 sq feet) with high ceiling, ...",The district has any service or shop you may d...,"100 m2 loft (1100 sq feet) with high ceiling, ...",none,,...,t,,{PARIS},t,f,strict_14_with_grace_period,f,f,1,


Extractig some first summary statistics and start understanding the distribution of your data (data.describe(), data.info(), sns.pairplot(data), etc.)

In [38]:
#data.info()
#data.describe()


In [39]:
#sns.pairplot(data)

## II. Cleaning

Missing data, N/A values, duplicated lines, data not properly loaded

In [40]:
#data.info()

data_cleaned = data.drop(["id","weekly_price","monthly_price","square_feet",'host_acceptance_rate','neighbourhood_group_cleansed','notes','medium_url', 'thumbnail_url', 'xl_picture_url', 'host_acceptance_rate','summary','space','license'], axis=1)
#data_cleaned = data_cleaned.drop(['last_review'], axis=1)
#data_cleaned = data_cleaned.loc[data_cleaned['reviews_per_month'] < 10]
data_cleaned.dropna(subset=["zipcode","cleaning_fee","state","security_deposit","neighbourhood","jurisdiction_names","review_scores_accuracy","host_neighbourhood","interaction","house_rules","host_about","neighborhood_overview", "transit", "access", "host_response_time"], inplace=True)

#data_cleaned.isnull().sum()
#data_cleaned.shape #(47350, 15)
#data_cleaned['interaction']
data_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3846 entries, 2 to 59299
Data columns (total 83 columns):
listing_url                         3846 non-null object
scrape_id                           3846 non-null int64
last_scraped                        3846 non-null object
name                                3846 non-null object
description                         3846 non-null object
experiences_offered                 3846 non-null object
neighborhood_overview               3846 non-null object
transit                             3846 non-null object
access                              3846 non-null object
interaction                         3846 non-null object
house_rules                         3846 non-null object
picture_url                         3846 non-null object
host_id                             3846 non-null int64
host_url                            3846 non-null object
host_name                           3846 non-null object
host_since                          384

In [53]:
data_cleaned["price"] = data_cleaned["price"].apply(lambda x : x.replace('$',' '))
data_cleaned

Unnamed: 0,listing_url,scrape_id,last_scraped,name,description,experiences_offered,neighborhood_overview,transit,access,interaction,...,review_scores_value,requires_license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
2,https://www.airbnb.com/rooms/5396,20181207151406,2018-12-08,Explore the heart of old Paris,"Cozy, well-appointed and graciously designed s...",none,"You are within walking distance to the Louvre,...",The flat is close to two or three major metro ...,"The flat includes its own modern bathroom, a w...",We expect guests to operate rather independent...,...,10.0,t,{PARIS},t,f,strict_14_with_grace_period,f,f,1,1.29
8,https://www.airbnb.com/rooms/10010,20181207151406,2018-12-08,Paris view from my balcony - B&B,"A real ""home from home"" in the heart of Paris ...",none,"This area typically Parisian, is nice, trendy ...",I can also organize transport service.Easy acc...,Services and equipment : free wireless intern...,"I welcome you personally on your arrival, I gi...",...,10.0,t,{PARIS},f,f,strict_14_with_grace_period,f,f,6,1.33
9,https://www.airbnb.com/rooms/10270,20181207151406,2018-12-08,Be charmed by real Paris 11th - B&B,Here you will find a true home from home in Pa...,none,"This area typically Parisian, is nice, trendy ...",Easy access from the train stations and airpor...,The price includes : Breakfast provided to pre...,"I welcome you personally upon arrival, I give ...",...,10.0,t,{PARIS},f,f,strict_14_with_grace_period,f,f,6,1.46
24,https://www.airbnb.com/rooms/16455,20181207151406,2018-12-08,"Spacious, Light-Filled Apartment","Our apartment is spacious, bright, centrally-l...",none,"A huge park, some smaller ones (and a little g...",Super convenient public transportation. Two m...,"Guests can use the piano, but pianists who mak...",Of course guests can contact me if they have q...,...,8.0,t,{PARIS},f,f,strict_14_with_grace_period,f,f,2,0.30
25,https://www.airbnb.com/rooms/16457,20181207151406,2018-12-08,***ROOFTOP LE MARAIS***,"STUDIO @ Le Marais Central location, 100m from...",none,"Very lively area, plenty of bars, restaurants ...","Very central, metro station nearby. Line 1, ""S...",All the apartment.,Keys provided for your arrival. Deposit is man...,...,9.0,t,{PARIS},f,f,moderate,t,t,1,3.11
35,https://www.airbnb.com/rooms/19306,20181207151406,2018-12-08,PARIS ROMANTIC MAISONETTE,Charming large studio have a lovely bedroom ...,none,LOCATION!LOCATION !! LOCATION!! 3 MINUTES WALK...,A set of instructions will be sent with instru...,"Booking one of Quartier Latin Apartments, the...","As I don't live in the building, you can conta...",...,9.0,t,{PARIS},t,f,strict_14_with_grace_period,f,f,3,1.74
36,https://www.airbnb.com/rooms/20823,20181207151406,2018-12-08,TWO FLOORS apartment in Paris 5eme,!!AIR CONDITIONING & STORAGE FOR YOUR LUGGAGE ...,none,The MOUFFETARD neighborhood is very little kno...,A set of instructions will be sent with instru...,"Booking one of Quartier Latin Apartments, the...","As I don't live in the building, but you can c...",...,9.0,t,{PARIS},f,f,strict_14_with_grace_period,f,f,3,1.59
38,https://www.airbnb.com/rooms/21167,20181207151406,2018-12-08,A peaceful and very well located bedroom,"Located in the 18th district of Paris, in the ...",none,The area is very welcoming. 15 mn walking dis...,"Bus95 (crossing Paris from North to South, via...",Ktichen and bathroom.,I would be happy to help you to make you stay ...,...,9.0,t,{PARIS},t,f,strict_14_with_grace_period,f,f,1,0.67
39,https://www.airbnb.com/rooms/21194,20181207151406,2018-12-08,LOVELY STUDIO IN QUARTIER LATIN,This beautiful cute studio of 18m² on the grou...,none,The MOUFFETARD neighborhood is very little kno...,"The parking Patriarches, Paris 5eme is very n...",The entire studio.,je reste disponible pour mes voyageurs en cas ...,...,9.0,t,{PARIS},t,f,strict_14_with_grace_period,f,f,3,1.41
40,https://www.airbnb.com/rooms/21264,20181207151406,2018-12-08,"Paris holidays, with family or friends",A great space for four to five people - Parfai...,none,"Lots of light, calm. Shopping and restaurants ...",Closed to five metrostations and RER B Près de...,Free WiFi Equiped bathroom with towels Well eq...,"As much as possible (personal, phone, e-mail) ...",...,10.0,t,{PARIS},f,f,moderate,f,f,1,0.19


In [68]:
df = data_cleaned[["reviews_per_month","review_scores_rating", "host_listings_count", "accommodates", 'bathrooms','review_scores_accuracy']]
df.isnull().sum()

reviews_per_month         0
review_scores_rating      0
host_listings_count       0
accommodates              0
bathrooms                 0
review_scores_accuracy    0
dtype: int64

##  II - Features Preparation

In [69]:
from sklearn.model_selection import train_test_split
from sklearn import svm

X_train, X_test, y_train, y_test = train_test_split(df, data_cleaned['price'],
                                                    test_size=0.2, 
                                                    random_state=0)



In [70]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df)
scaler.transform(df)

  return self.partial_fit(X, y)
  """


array([[-0.22372823,  0.11015702, -0.16670724, -0.7850019 , -0.11119047,
        -0.93416709],
       [-0.19852993,  0.86859583, -0.08435571,  0.89197494, -0.11119047,
         0.59925443],
       [-0.11663548,  0.56522031, -0.08435571,  0.33298266, -0.11119047,
         0.59925443],
       ...,
       [-0.40641587,  1.02028359, -0.16670724,  0.33298266, -0.11119047,
        -2.4675886 ],
       [-0.40641587,  1.02028359, -0.15023693, -0.7850019 , -0.11119047,
         0.59925443],
       [ 0.85349887,  1.02028359, -0.15023693, -0.7850019 , -0.11119047,
         0.59925443]])

In [71]:
clf = svm.SVC()
clf.fit(X_train, y_train)
clf_score = clf.score(X_test, y_test)
print(clf_score)



0.048051948051948054
