Finding Missing Values

https://medium.com/analytics-vidhya/python-finding-missing-values-in-a-data-frame-3030aaf0e4fd

Handling Missing Values in a Data Frame


https://medium.com/analytics-vidhya/python-handling-missing-values-in-a-data-frame-4156dac4399

## Seattle Airbnb Open Data

https://www.kaggle.com/datasets/airbnb/seattle?resource=download&select=reviews.csv

####  Context
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in Seattle, WA.

#### Content
The following Airbnb activity is included in this Seattle dataset:

Listings, including full descriptions and average review score

Reviews, including unique id for each reviewer and detailed comments

Calendar, including listing id and the price and availability for that day
#### Inspiration
Can you describe the vibe of each Seattle neighborhood using listing descriptions?
What are the busiest times of the year to visit Seattle? By how much do prices spike?
Is there a general upward trend of both new Airbnb listings and total Airbnb visitors to Seattle?

http://insideairbnb.com/get-the-data/

### Finding Missing Values in a Pandas Data Frame

In [3]:
# Importing the packages
import pandas as pd
import numpy as np

Step 1: Load the data frame and study the structure of the data frame.

In [4]:
df_listing = pd.read_csv("listings.csv")
display(df_listing.describe())
display(df_listing.head())
display(df_listing.dtypes.value_counts())

FileNotFoundError: [Errno 2] No such file or directory: 'listings.csv'

Step 2: Separate categorical and numerical columns in the data frame
- False = numerical

In [None]:
df_listing.dtypes

In [None]:
df_listing.dtypes == 'object'

In [None]:
# Separated the original data frame into 2 groups and assigned them new variable
numerical_value = df_listing.columns[df_listing.dtypes != 'object']
categorical_value = df_listing.columns[df_listing.dtypes == 'object']

print(numerical_value)
print(categorical_value)

Step 3: Find the missing values

In [None]:
# only Prints out the column in the numerical_value which consists of all the columns in the data frame which are not object data type
df_listing[numerical_value]

- True = missing values
- False = does not have missing values

In [None]:
# isnull() to find out all the fields which have the missing values
df_listing[numerical_value].isnull()

In [None]:
# Sum/count of all missing values in each column
df_listing[numerical_value].isnull().sum()

In [None]:
# sorting out the columns in descending order to have a better picture
df_listing[numerical_value].isnull().sum().sort_values(ascending=False)

To get % of missing values in each column you can divide by length of the data frame


In [None]:
# gives you the number of rows in the data frame
len(df_listing)

As you can see below license column is missing 100% of the data and square_feet column is missing 97% of data.

In [None]:
df_listing[numerical_value].isnull().sum().sort_values(ascending=False)/len(df_listing)

Conclusion
1. Use isnull() function to identify the missing values in the data frame
2. Use sum() functions to get sum of all missing values per column
3. use sort_values(ascending=False) function to get columns with the missing values in descending order
4. Divide by len(df) to get % of missing values in each column


### Handling Missing Values in a Data Frame

1. Deleting all rows/columns with missing data:This can be used when you have rows/columns where majority of the data is missing. When you are deleting rows/columns you might be losing some valuable information and lead to biased models. So analyze your data before deleting and check if there is any particular reason for missing data.

2. Imputing data: This is by far the most common way used to handle missing data. In this method you impute a value where data is missing. Imputing data can introduce bias into the datasets. Imputation can be done multiple ways.


In [None]:
# % of missing data on each numerical column
df_listing[numerical_value].isnull().sum().sort_values(ascending=False)/len(df_listing)

- 100% of the values in license column and 97% of the square_feet column are missing data in numerical columns.

In [None]:
# % of missing data on each categorical column
df_listing[categorical_value].isnull().sum().sort_values(ascending=False)/len(df_listing)

- 60% of the values in monthly_price, 51% of values in security_deposit and 47% of values in weekly_price are missing data

### 1. Deleting rows/columns with missing data:

axis =1 represents column, axis=0 represent rows.

In [None]:
# Using the drop method
df = df_listing.drop(columns=["license","square_feet","monthly_price","security_deposit","weekly_price"],axis=1)

In [None]:
# Separated the new data frame [df] into 2 groups and assigned them new variable

numerical_value_1 = df.columns[df.dtypes != 'object']
categorical_value_1 = df.columns[df.dtypes == 'object']

In [None]:
# % of missing data on each numerical column for the new df data after dropping
df[numerical_value_1].isnull().sum().sort_values(ascending=False)/len(df)


In [None]:
# % of missing data on each categorical column for the new df data after dropping
df[categorical_value_1].isnull().sum().sort_values(ascending=False)/len(df)


#### Deleting rows/columns with NA
- 'any' -->  even if one value has NA in row or column it will delete those columns. 
-  “all” only if all the values in rows/columns have NA deletion will happen.
- If 0 then drops rows with NA values, if 1 then drops columns with NA values.

In [None]:
# delete row with NA in host_name column
df = df.dropna(subset=['host_name'],how='any',axis=0)

### 2. Imputing Data
- imputing mean, median or mode of the column in place of the missing values.

Filling catagorical data

In [None]:
df_listing[categorical_value].isnull().sum()

In [None]:
missing = df_listing[categorical_value].fillna("Missing_Data")
missing.isnull().sum()

Filling numerical data

In [None]:
df_listing[numerical_value].isnull().sum().sort_values(ascending=False)

In [None]:
fill_mean = df_listing[numerical_value].fillna(np.mean)
fill_mean.isnull().sum()

Using lambda x: x.fillna(x.mean()),axis=0 to calculate mean

In [None]:
fill_mean = df_listing[numerical_value].apply(lambda x: x.fillna(x.mean()),axis=0)

In [None]:
fill_mean.isnull().sum()

In [None]:
# function to fill missing values with mean for numerical col
fill_mean= lambda x: x.fillna(x.mean())

# apply finction to fill the missing values
df_listing[numerical_value] = df_listing[numerical_value].apply(fill_mean)

In [None]:
df_listing.isnull().sum()