<a href="https://colab.research.google.com/github/rosh4github/eportfolio/blob/main/ML_Unit_2_EDA_Analysis_E_Portfolio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exploratory Data Analysis (Python)

### 1. Importing the required libraries for EDA

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns               # Visualization
import matplotlib.pyplot as plt     # Visualization
%matplotlib inline
sns.set(color_codes=True)

### Loading the data into the data frame

In [None]:
df = pd.read_csv("auto-mpg.csv")
# To display the top 5 rows
df.head(5)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [None]:
# To display the bottom 5 rows
df.tail(5)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
393,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
394,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
395,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
396,28.0,4,120.0,79,2625,18.6,82,1,ford ranger
397,31.0,4,119.0,82,2720,19.4,82,1,chevy s-10


### Checking the types of data

In [None]:
df.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower       object
weight            int64
acceleration    float64
model year        int64
origin            int64
car name         object
dtype: object

In [None]:
# clean the column
df['horsepower'].replace('?', np.nan, inplace=True)

df['horsepower'] = pd.to_numeric(df['horsepower'])
df.dtypes


mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight            int64
acceleration    float64
model year        int64
origin            int64
car name         object
dtype: object

### Dropping irrelevant columns

In [None]:
df = df.drop(['host_id', 'id', 'reviews_per_month', 'calculated_host_listings_count', 'last_review'], axis=1)
df.head(5)

KeyError: "['host_id', 'id', 'reviews_per_month', 'calculated_host_listings_count', 'last_review'] not found in axis"

### Renaming the columns
For ease of understanding

In [None]:
df = df.rename(columns={"name" : "listing_name", "neighbourhood_group" : "city"})
df.head(5)

### Checking for duplicate rows
If any

In [None]:
df.shape

In [None]:
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

In [None]:
# counts the number of rows
df.count()

### Checking for missing or null values
and dropping if applicable

In [None]:
print(df.isnull().sum())

In [None]:
# Drop the missing values
df = df.dropna()
df.count()

In [None]:
# confirmation of no null values
print(df.isnull().sum())

### Detecting Outliers

IQR score technique used

In [None]:
sns.boxplot(x=df['price'])

In [None]:
sns.boxplot(x=df['latitude'])

In [None]:
# method to select only the numeric columns before calculating the quantiles
numeric_df = df.select_dtypes(include='number')
Q1 = numeric_df.quantile(0.25)
Q3 = numeric_df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

In [None]:
numeric_df = numeric_df[~((numeric_df < (Q1 - 1.5 * IQR)) | (numeric_df > (Q3 + 1.5 * IQR))).any(axis=1)]
numeric_df.shape

48858 - 31700 = 17158 rows of outliers were removed

### Plot different features against one another (scatter), against frequency (histogram)

#### Histogram

In [None]:
df.room_type.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
plt.title("Number of listings per room type")
plt.ylabel('Number of listings')
plt.xlabel('Room Type')

#### Heat Maps

In [None]:
plt.figure(figsize=(10,5))
c=numeric_df.corr()
sns.heatmap(c,cmap="BrBG", annot=True)
c

No strong correlation - perhaps the non-numeric columns can be converted to see a potential correlation?

#### Scatterplot

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df['price'],df['city'])
ax.set_xlabel('Price')
ax.set_ylabel('City')
plt.show()

##### Scatter plot - for the city of manhattan - price vs neighbourhood

In [None]:
# boolean mask for Manhattan
manhattan_mask = df['city'] == 'Manhattan'

# dataframe filter to only include Manhattan
manhattan_df = df[manhattan_mask]

# scatter plot for Manhattan only
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(manhattan_df['price'], manhattan_df['neighbourhood'])
ax.set_xlabel('Price')
ax.set_ylabel('City')
plt.show()

### Random analysis

## References

Google Colab. (n.d.). Exploratory Data Analysis with Python. Available from: https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Exploratory_data_Analysis.ipynb#scrollTo=TEfC0QszTKX_ [Accessed 29 May 2024].  

Kaggle (n.d.) Auto-mpg dataset. Available from: https://www.kaggle.com/datasets/uciml/autompg-dataset?resource=download [Accessed 29 May 2024].