<a href="https://colab.research.google.com/github/ronkiks/SALARIES/blob/main/New_york_listing_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Documentation

 NumPy: It provides a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, and useful linear algebra, Fourier transform, and random number capabilities.

 Pandas: is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python.
 Matplotlib: is a comprehensive library for creating static, animated, and interactive visualizations in Python. It makes easy things easy and hard things possible.`
 Seaborn:  is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

In [None]:
df = pd.read_csv('/content/AB_NYC_2019.csv')

# Documentation
The pd.read_csv:  function is very versatile and has many optional parameters to control how the data is read.

'/content/AB_NYC_2019.csv' : It assumes the default delimiter (',') for separating values. It uses the first row (row 0) as the header for column names. It doesn't specify any other options, relying on the default behavior of read_csv.



In [None]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


## Documentation
df.head(): is to give you a quick preview of your DataFrame. It's especially helpful after loading data or making transformations to check if everything looks as expected.

In [None]:
df.isnull().sum()

Unnamed: 0,0
id,0
name,16
host_id,0
host_name,21
neighbourhood_group,0
neighbourhood,0
latitude,0
longitude,0
room_type,0
price,0


## Documentation
df.isnull(): is applied to your DataFrame (df). It creates a DataFrame of the same shape as df where each element is either True (if the original value was missing/NaN) or False (if the original value was not missing).

.sum(): is chained to df.isnull(). It then calculates the sum of True values (which represent missing values) along each column. Since True is treated as 1 and False as 0, this effectively counts the number of missing values in each column.

In [None]:
df.duplicated().sum()

0

 df.duplicated(): is applied to your DataFrame (df). It returns a Boolean Series where each element indicates whether the corresponding row in the DataFrame is a duplicate of a previous row (True) or not (False).

In [None]:
df.dtypes

Unnamed: 0,0
id,int64
name,object
host_id,int64
host_name,object
neighbourhood_group,object
neighbourhood,object
latitude,float64
longitude,float64
room_type,object
price,int64


## Documentation:
df.dtypes is used to understand the data types of the columns in the DataFrame. This is important because pandas uses data types to optimize storage and operations on the data.

In [None]:
df.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

## Documentation
df.columns is used to get a list or Index object containing the names of the columns in the DataFrame

### 1. Average Price of Listings:

In [None]:
average_price = df['price'].mean()
print(f"The average price of the listings is: {average_price}")

The average price of the listings is: 152.7206871868289


## Documentation

df['price']: is used to access a specific column from a DataFrame by its name. It returns a series containing the values from that column.

.mean(): A pandas series method that calculates the average of the values in the series. It handles missing values (NaN) by excluding them from the calculation.


 ### 2. Distribution of Room Types:

In [None]:
room_type_distribution = df['room_type'].value_counts(normalize=True) * 100
print("Distribution of Room Types (in %):")
print(room_type_distribution)

Distribution of Room Types (in %):
room_type
Entire home/apt    51.966459
Private room       45.661111
Shared room         2.372431
Name: proportion, dtype: float64


## Documentation
Entire home/apt: Approximately 51.97% of the listings in your dataset are for entire homes or apartments.
Private room: About 45.66% of the listings are for private rooms.
Shared room: Around 2.37% of the listings are for shared rooms.

### 3. Neighbourhood with Most Listings:

In [None]:
most_listings_neighbourhood = df['neighbourhood_group'].value_counts().idxmax()
print(f"The neighbourhood with the most listings is: {most_listings_neighbourhood}")

The neighbourhood with the most listings is: Manhattan


## Documentation

The output "The neighbourhood with the most listings is: Manhattan" indicates that, based on the dataset, Manhattan has the highest number of Airbnb listings compared to other neighborhood groups.

### 4. Correlation between Reviews and Price

In [None]:
correlation = df['number_of_reviews'].corr(df['price'])
print(f"The correlation between number of reviews and price is: {correlation}")

The correlation between number of reviews and price is: -0.047954226582662185


## Documentation
The .corr() method in pandas is used to calculate the pairwise correlation between two Series. By default, it uses the Pearson correlation coefficient, but you can specify other methods like 'kendall' or 'spearman' using the method argument.

The Pearson correlation coefficient measures the linear relationship between two variables. It ranges from -1 to +1:

+1: Perfect positive linear correlation (as one variable increases, the other increases proportionally).
0: No linear correlation (no relationship between the variables).
-1: Perfect negative linear correlation (as one variable increases, the other decreases proportionally).

### 5. Distribution of Cancellation Policies:

In [None]:
cancellation_policy_distribution = df['calculated_host_listings_count'].value_counts(normalize=True) * 100
print("Distribution of Cancellation Policies (in %):")
print(cancellation_policy_distribution)

Distribution of Cancellation Policies (in %):
calculated_host_listings_count
1      66.066060
2      13.616934
3       5.834952
4       2.945086
5       1.728193
6       1.165763
8       0.850803
7       0.816034
327     0.668780
9       0.478577
232     0.474486
10      0.429492
96      0.392678
12      0.368136
13      0.265876
121     0.247469
11      0.224972
52      0.212701
103     0.210655
33      0.202475
49      0.200429
91      0.186113
87      0.177932
15      0.153390
14      0.143164
23      0.141119
34      0.139074
17      0.139074
65      0.132938
31      0.126802
28      0.114531
18      0.110441
25      0.102260
50      0.102260
47      0.096124
43      0.087944
20      0.081808
39      0.079763
37      0.075672
32      0.065446
30      0.061356
29      0.059311
27      0.055220
26      0.053175
21      0.042949
19      0.038859
16      0.032723
Name: proportion, dtype: float64


In [None]:
# If you have a column named 'cancel_policy', rename it:
df = df.rename(columns={'cancel_policy': 'cancellation_policy'})

## Documentation

The rename() method in pandas is highly versatile and allows you to rename columns, index labels, or both. It is used to improve the clarity and consistency of the data.

columns argument: is a dictionary-like object where you specify the mapping of old column names to new column names.
inplace argument: If set to True, the renaming is done directly on the original DataFrame. If False (default), a new DataFrame with the renamed columns is returned.