## Importing the libraries
* A (software) library is a collection of files (called modules) that contains functions for use by other programs.
* May also contain data values (e.g., numerical constants) and other things.
* Library’s contents are supposed to be related, but there’s no way to enforce that.
* The Python standard library is an extensive suite of modules that comes with Python itself.
* Many additional libraries are available from PyPI (the Python Package Index).

In [None]:
import os
import folium
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Importing the dataset
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
For the source file of pandas you can go on the [`Github link`](https://github.com/pandas-dev/pandas)

In [None]:
dataset = pd.read_csv("/kaggle/input/us-airbnb-open-data/AB_US_2020.csv")
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

In [None]:
print(X)

In [None]:
print(y)

## Data Visualization
Data visualization is the discipline of trying to understand data by placing it in a visual context so that patterns, trends and correlations that might not otherwise be detected can be exposed.

Python offers multiple great graphing libraries that come packed with lots of different features. No matter if you want to create interactive, live or highly customized plots python has an excellent library for you.

To get a little overview here are a few popular plotting libraries:
* Matplotlib: low level, provides lots of freedom
* Pandas Visualization: easy to use interface, built on Matplotlib
* Seaborn: high-level interface, great default styles
* ggplot: based on R’s ggplot2, uses Grammar of Graphics
* Plotly: can create interactive plots

In [None]:
dataset.head()

In [None]:
dataset.tail()

In [None]:
dataset.describe()

In [None]:
dataset.corr()

In [None]:
dataset.columns

In [None]:
dataset.shape

In [None]:
X.shape

In [None]:
y.shape

In [None]:
plt.rcParams['figure.figsize']=10,10
g = sns.heatmap(dataset.corr(),annot=True, fmt = ".2f", cmap = "coolwarm")

## Working with Seaborn

Matplotlib has proven to be an incredibly useful and popular visualization tool, but even avid users will admit it often leaves much to be desired. There are several valid complaints about Matplotlib that often come up:

* Prior to version 2.0, Matplotlib's defaults are not exactly the best choices. It was based off of MATLAB circa 1999, and this often shows.
* Matplotlib's API is relatively low level. Doing sophisticated statistical visualization is possible, but often requires a lot of boilerplate code.
* Matplotlib predated Pandas by more than a decade, and thus is not designed for use with Pandas DataFrames. In order to visualize data from a Pandas DataFrame, you must extract each Series and often concatenate them together into the right format. It would be nicer to have a plotting library that can intelligently use the DataFrame labels in a plot.

**An answer to these problems is Seaborn. [Seaborn](https://seaborn.pydata.org/api.html) provides an API on top of Matplotlib that offers sane choices for plot style and color defaults, defines simple high-level functions for common statistical plot types, and integrates with the functionality provided by Pandas DataFrames.**

For More study on Seaborn, you can go on with the [**`Blog`**](https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html) and the [**`Colab Notebook`**](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.14-Visualization-With-Seaborn.ipynb) for the Seaborn will be alot helpful.

For visualization works, we can see the best [`w3schools work`](https://www.w3schools.com/python/numpy_random_seaborn.asp)

In [None]:
sns.jointplot(x='id', y='host_id', data=dataset)

In [None]:
sns.jointplot(x='id', data=dataset, y='latitude')

In [None]:
sns.jointplot(x='id', data=dataset, y='longitude')

In [None]:
# sns.jointplot(x='id', data=dataset, y='price')

In [None]:
sns.jointplot(x='id', data=dataset, y='minimum_nights')

In [None]:
sns.jointplot(x='id', data=dataset, y='number_of_reviews')

In [None]:
sns.jointplot(x='id', data=dataset, y='calculated_host_listings_count')

In [None]:
sns.jointplot(x='id', data=dataset, y='availability_365')

In [None]:
sns.jointplot(data=dataset, x='host_id', y='latitude')

In [None]:
sns.jointplot(data=dataset, x='host_id', y='availability_365')

In [None]:
sns.jointplot(data=dataset, x='host_id', y='calculated_host_listings_count')

In [None]:
sns.jointplot(data=dataset, x='host_id', y='number_of_reviews')

In [None]:
sns.jointplot(data=dataset, x='host_id', y='minimum_nights')

In [None]:
sns.jointplot(data=dataset, x='host_id', y='price')

In [None]:
sns.jointplot(data=dataset, x='host_id', y='longitude')

In [None]:
sns.jointplot(data=dataset, x='host_id', y='latitude')

In [None]:
sns.jointplot(data=dataset, x='latitude', y='longitude')

In [None]:
sns.jointplot(data=dataset, x='latitude', y='price')

In [None]:
sns.jointplot(data=dataset, x='latitude', y='minimum_nights')

In [None]:
sns.jointplot(data=dataset, x='latitude', y='number_of_reviews')

In [None]:
sns.jointplot(data=dataset, x='latitude', y='calculated_host_listings_count')

In [None]:
sns.jointplot(data=dataset, x='latitude', y='availability_365')

In [None]:
sns.jointplot(data=dataset, x='longitude', y='price')

In [None]:
sns.jointplot(data=dataset, x='longitude', y='minimum_nights')

In [None]:
sns.jointplot(data=dataset, x='longitude', y='number_of_reviews')

In [None]:
sns.jointplot(data=dataset, x='longitude', y='calculated_host_listings_count')

In [None]:
sns.jointplot(data=dataset, x='longitude', y='availability_365')

In [None]:
sns.jointplot(data=dataset, x='price', y='minimum_nights')

In [None]:
sns.jointplot(data=dataset, x='price', y='number_of_reviews')

In [None]:
sns.jointplot(data=dataset, x='price', y='calculated_host_listings_count')

In [None]:
sns.jointplot(data=dataset, x='price', y='availability_365')

In [None]:
sns.jointplot(data=dataset, x='price', y='minimum_nights')

In [None]:
sns.jointplot(data=dataset, x='price', y='number_of_reviews')

In [None]:
sns.jointplot(data=dataset, x='price', y='calculated_host_listings_count')

In [None]:
sns.jointplot(data=dataset, x='price', y='availability_365')

In [None]:
sns.jointplot(data=dataset, x='minimum_nights', y='number_of_reviews')

In [None]:
sns.jointplot(data=dataset, x='minimum_nights', y='calculated_host_listings_count')

In [None]:
sns.jointplot(data=dataset, x='minimum_nights', y='availability_365')

In [None]:
sns.jointplot(data=dataset, x='availability_365', y='number_of_reviews')

In [None]:
sns.jointplot(data=dataset, x='availability_365', y='calculated_host_listings_count')

In [None]:
sns.jointplot(data=dataset, x='number_of_reviews', y='calculated_host_listings_count')

## Intrestingly, I found Folium library which can give the map visualization
folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the leaflet.js library. Manipulate your data in Python, then visualize it in on a Leaflet map via folium.

The concept of folium makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map. It enables both the binding of data to a map for choropleth visualizations as well as passing rich vector/raster/HTML visualizations as markers on the map.

The library has a number of built-in tilesets from OpenStreetMap, Mapbox, and Stamen, and supports custom tilesets with Mapbox or Cloudmade API keys. folium supports both Image, Video, GeoJSON and TopoJSON overlays.

For more information go to the [link](https://python-visualization.github.io/folium/)

In [None]:
MapModel = dataset[['latitude', 'longitude']]

## Using the clustering model and visualizer model

In [None]:
model = KMeans()
kmeans = KMeans(n_clusters = 15, random_state=42).fit(MapModel)
kmeans.cluster_centers_
cluster_map = folium.Map([41.8781, -87.6298], zoom_start=4)
for i in range(kmeans.cluster_centers_.shape[0]):
    num = sum(kmeans.labels_ == i)
    folium.CircleMarker([kmeans.cluster_centers_[i,0], kmeans.cluster_centers_[i,1]],
                        radius=15,
                        popup=str(num) + ' Listings Associated with this Cluster',
                        fill_color="#3db7e4",
                        ).add_to(cluster_map)

## visualizing the cluster map

In [None]:
cluster_map

In [None]:
#Numeric Features Distribution Analysis
numeric_features = dataset.select_dtypes(include=['int64','float64']).columns
nominal_features = dataset.select_dtypes(include=['object'])
numeric_features=numeric_features.delete(0)
fig, axes = plt.subplots(nrows=2, ncols=4)
aux = 0
fig.set_figheight(15)
fig.set_figwidth(25)
for row in axes:
    for col in row:
        dataset[numeric_features[aux]].plot(kind='kde',ax=col)
        col.set_title(numeric_features[aux] +' Distribution',fontsize=16,fontweight='bold')
        aux+=1

### Removing Outliers

In [None]:
lower_bound = .25
upper_bound = .75
iqr = dataset[dataset['price'].between(dataset['price'].quantile(lower_bound), dataset['price'].quantile(upper_bound), inclusive=True)]
iqr = iqr[iqr['number_of_reviews'] > 0]
iqr = iqr[iqr['calculated_host_listings_count'] < 10]
iqr = iqr[iqr['number_of_reviews'] < 200]
iqr = iqr[iqr['minimum_nights'] < 10]
iqr = iqr[iqr['reviews_per_month'] < 5]

In [None]:
drop_list = ['name','neighbourhood_group','host_id','host_name','last_review']
dataset.drop(dataset[drop_list], axis=1, inplace=True)

In [None]:
plt.rcParams['figure.figsize'] = (20,10)
plt.style.use('dark_background')

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

In [None]:
a_data = pd.read_csv('/kaggle/input/us-airbnb-open-data/AB_US_2020.csv',usecols=['id','name','latitude','longitude','room_type','price','minimum_nights','number_of_reviews','last_review','reviews_per_month','calculated_host_listings_count',
                                                                                'availability_365','city'])

In [None]:
numeric_features = a_data.select_dtypes(include=['int64','float64']).columns
nominal_features = a_data.select_dtypes(include=['object'])
numeric_features=numeric_features.delete(0)

In [None]:
a_data.info()
a_data.head(5)

In [None]:
# data import and missing value control
missing = a_data.isna().sum()
missing /= a_data.shape[0]
missing *=100
missing = missing.to_frame().rename(columns={0:'Precent Of Missing Values'})
missing

In [None]:
ax = sns.heatmap(a_data.isna().T)
ax.set_title('Missing Values Proportion',fontsize=19,fontweight='bold')

In [None]:
a_data = a_data.dropna()

In [None]:
# numeric feature distribution
fig, axes = plt.subplots(nrows=2, ncols=4)
aux = 0
fig.set_figheight(17)
fig.set_figwidth(25)
for row in axes:
    for col in row:
        a_data[numeric_features[aux]].plot(kind='kde',ax=col)
        col.set_title(numeric_features[aux] +' Distribution',fontsize=16,fontweight='bold')
        aux+=1

In [None]:
# outlier removes
# Removing Outliers
lower_bound = .25
upper_bound = .75
iqr = a_data[a_data['price'].between(a_data['price'].quantile(lower_bound), a_data['price'].quantile(upper_bound), inclusive=True)]
iqr = iqr[iqr['number_of_reviews'] > 0]
iqr = iqr[iqr['calculated_host_listings_count'] < 10]
iqr = iqr[iqr['number_of_reviews'] < 200]
iqr = iqr[iqr['minimum_nights'] < 10]
iqr = iqr[iqr['reviews_per_month'] < 5]

In [None]:
# process distribution after removing outliers
fig, axes = plt.subplots(nrows=2, ncols=4)
aux = 0
fig.set_figheight(17)
fig.set_figwidth(25)
for row in axes:
    for col in row:
        iqr[numeric_features[aux]].plot(kind='kde',ax=col)
        if numeric_features[aux] not in ['latitude','longitude']:
            col.set_xlim(0,iqr[numeric_features[aux]].max()+iqr[numeric_features[aux]].max()*0.25)
        col.set_title(numeric_features[aux] +' Distribution',fontsize=16,fontweight='bold')
        aux+=1

## IF YOU FEEL IT IS HELPFUL, THEN DO UPVOTE. 

![](https://media.tenor.com/images/b80bf8090ffc340f7a9afb610dfc71fb/tenor.gif)