# Data Preprocessing - Removing Null Value Rows

While making a Data Frame from a csv file, many blank columns are imported as null value into the Data Frame which later creates problems while operating that data frame. Pandas isnull() and notnull() methods are used to check and manage NULL values in a data frame.
 
#### Dataframe.isnull()
- **Syntax:** Pandas.isnull(“DataFrame Name”) or DataFrame.isnull()
- **Parameters:** Object to check null values for
- **Return Type:** Dataframe of Boolean values which are True for NaN values 
 
If we add the sum function after the isnull() it will give us the total number of data which are not present or null in our dataset.

Let us see with the help of an example.

In [1]:
import pandas as pd

data = pd.read_csv("googleplaystore.csv")

print(data.isnull().sum())

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64


So we can see that our columns are App, Category, Rating etc. So the number which is being displayed after the column names is basically the total number of null values that particular column is containing.

So we can delete all the rows which are present in our dataframe with the help of dropna() function. Let us see the implementation of that now.

In [2]:
df = data.dropna()

print(df.isnull().sum())

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64


So no more null values are present in any of the columns.

# Data Analysis - Numeric

We have already seen how we could remove the null values from our dataset. Let's analyse the numeric data present with us and try to find out the inferences which we can get.

So let us see our dataset again before getting started.

In [3]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


Let's do some analysis on the Rating column

In [4]:
print(df['Rating'])

0        4.1
1        3.9
2        4.7
3        4.5
4        4.3
        ... 
10834    4.0
10836    4.5
10837    5.0
10839    4.5
10840    4.5
Name: Rating, Length: 9360, dtype: float64


So the data type of the Rating is float64

### Finding out the Average Rating

In [5]:
s = 0

for i in df['Rating']:
    s += i

s = int(s)
a = s/len(df['Rating'])

print("The average Rating of the datas in our data set is : ", round(a, 3))

The average Rating of the datas in our data set is :  4.192


### How many Apps are there with Rating 5

In [6]:
c = 0

for i in df['Rating']:
    if (i == 5.0):
        c += 1

print(f"There are {c} applications with rating 5")

There are 274 applications with rating 5


### Apps with Rating between the range of 4 and 4.5

In [7]:
c = 0

for i in df['Rating']:
    if (i >= 4.0 and i <= 4.5):
        c += 1
print(f"There are {c} applications with rating between the range of 4 and 4.5")

There are 5446 applications with rating between the range of 4 and 4.5


# Data Analysis - Categorical

We have already done some analysis on numerical data. Let us work with Categorical data now. 

Before doing analysis let us see our dataset again.

In [8]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


### Total Number of Unique Categories

We can use the unique() function to get all the unique values of Category column

In [9]:
df['Category'].unique()

array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION'],
      dtype=object)

To get the total number of Unique values we can use the nunique() function.

In [11]:
n = df['Category'].nunique()

print("There are a total of", n, "unique values in Category columns")

There are a total of 33 unique values in Category columns


### Total Number of apps in ART_AND_DESIGN

In [12]:
c = 0

for i in df['Category']:
    if(i == 'ART_AND_DESIGN'):
        c += 1
print(f"There are a total of {c} applications in ART_AND_DESIGN")

There are a total of 61 applications in ART_AND_DESIGN


### Total number of Free and Paid Apps

In [14]:
f = 0
p = 0

for i in df['Type']:
    if(i == 'Free'):
        f += 1

for i in df['Type']:
    if(i == 'Paid'):
        p += 1

print(f"There are a total number of {f} free and {p} paid apps")

There are a total number of 8715 free and 645 paid apps


### Percentage of Free and Paid Apps

In [15]:
ff = f / len(df['Type']) * 100
ff = round(ff, 2)

pp = 100 - ff
pp = round(pp, 2)

print(f"A total of {ff}% of the apps are Free and {pp}% of the apps are paid")

A total of 93.11% of the apps are Free and 6.89% of the apps are paid


# Data Analysis - Automatic Categorical

Let us see now how we would automate our categorical data. Firstly let us see our dataset and then we can start with the analysis.

In [16]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


### Total number of apps in each category

In [17]:
categories = {}

for name in df['Category'].unique():
    ct = 0
    for i in df['Category']:
        if(i == name):
            ct += 1
    categories[name] = ct
    
print(categories)

{'ART_AND_DESIGN': 61, 'AUTO_AND_VEHICLES': 73, 'BEAUTY': 42, 'BOOKS_AND_REFERENCE': 178, 'BUSINESS': 303, 'COMICS': 58, 'COMMUNICATION': 328, 'DATING': 195, 'EDUCATION': 155, 'ENTERTAINMENT': 149, 'EVENTS': 45, 'FINANCE': 323, 'FOOD_AND_DRINK': 109, 'HEALTH_AND_FITNESS': 297, 'HOUSE_AND_HOME': 76, 'LIBRARIES_AND_DEMO': 64, 'LIFESTYLE': 314, 'GAME': 1097, 'FAMILY': 1746, 'MEDICAL': 350, 'SOCIAL': 259, 'SHOPPING': 238, 'PHOTOGRAPHY': 317, 'SPORTS': 319, 'TRAVEL_AND_LOCAL': 226, 'TOOLS': 733, 'PERSONALIZATION': 312, 'PRODUCTIVITY': 351, 'PARENTING': 50, 'WEATHER': 75, 'VIDEO_PLAYERS': 160, 'NEWS_AND_MAGAZINES': 233, 'MAPS_AND_NAVIGATION': 124}


So we now know now each and every category is containing how many apps in our dataset with the help of this analysis.

### Total number of apps in each Type

In [18]:
types = {}

for name in df['Type'].unique():
    ct = 0
    for i in df['Type']:
        if(i == name):
            ct += 1
    types[name] = ct
    
print(types)

{'Free': 8715, 'Paid': 645}


### Total number of apps in each Content Rating

In [19]:
content_rating = {}

for name in df['Content Rating'].unique():
    ct = 0
    for i in df['Content Rating']:
        if(i == name):
            ct += 1
    content_rating[name] = ct
    
print(content_rating)

{'Everyone': 7414, 'Teen': 1084, 'Everyone 10+': 397, 'Mature 17+': 461, 'Adults only 18+': 3, 'Unrated': 1}


So we can see from here the number of apps in each and every individual Content Rating

# Null Values Handling - Numeric

Earlier we have discussed an approach to remove null values from a dataset. But removing the null values is always not the most optical approach to work on a dataset. Let us see here how we can handle the null values instead of dropping them.

Let us import pandas and read the csv file initially.

In [22]:
import pandas as pd

df = pd.read_csv('Data.csv')

print(df)

   Country   Age   Salary Purchased
0   France  44.0  72000.0       Yes
1    Spain  27.0  48000.0       Yes
2      NaN  30.0  54000.0       NaN
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


So we can see we don't have a big dataset and there are NaN or null values present in almost every column. So now how would we handle them?

**SimpleImputer** is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder. 
It is implemented by the use of the **SimpleImputer()** method which takes the following arguments :
 

- **missing_values :** The missing_values placeholder which has to be imputed. By default is NaN 
- **strategy :** The data which will replace the NaN values from the dataset. The strategy argument can take the values – ‘mean'(default), ‘median’, ‘most_frequent’ and ‘constant’. 
- **fill_value :** The constant value to be given to the NaN data using the constant strategy. 

In [23]:
import numpy as np
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')

df.iloc[:, 1:3] = imputer.fit_transform(df.iloc[:, 1:3].values)

Now if we print the DataFrame(df) we will notice that the null values of the numeric column has been removed

In [24]:
print(df)

   Country   Age   Salary Purchased
0   France  44.0  72000.0       Yes
1    Spain  27.0  48000.0       Yes
2      NaN  30.0  54000.0       NaN
3    Spain  38.0  61000.0        No
4  Germany  40.0  48000.0       Yes
5   France  35.0  58000.0       Yes
6    Spain  27.0  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


To confirm it we can use the isnull().sum() function.

In [25]:
print(df.isnull().sum())

Country      1
Age          0
Salary       0
Purchased    1
dtype: int64


So the null values from the numeric data columns has been removed.

# Null Values Handling - Categorical

We have already seen how we can handle the numerical data with the help of scikit-learn. In the case of Categorical Data things are a little different because we can't use mean, median. So we basically need to choose the most frequent(mode) for filling up the null values in the case of Categorical data.

Let us import pandas and read the csv file initially.

In [26]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,Yes
1,Spain,27.0,48000.0,Yes
2,,30.0,54000.0,
3,Spain,38.0,61000.0,No
4,Germany,40.0,48000.0,Yes
5,France,35.0,58000.0,Yes
6,Spain,27.0,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


So this instead of using iloc and selecting only the numeric columns we will be selecting all the columns. We will be using most frequent as the strategy parameter because we cannot use mean and median for categorical datas.

In [27]:
imputer = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')

df.iloc[:, :] = imputer.fit_transform(df.iloc[:, :].values)

Now if we print the DataFrame(df) we will notice that the null values of the categorical column as well as the numeric column has been removed.

In [28]:
print(df)

   Country   Age   Salary Purchased
0   France  44.0  72000.0       Yes
1    Spain  27.0  48000.0       Yes
2   France  30.0  54000.0       Yes
3    Spain  38.0  61000.0        No
4  Germany  40.0  48000.0       Yes
5   France  35.0  58000.0       Yes
6    Spain  27.0  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


To confirm it we can use the isnull().sum() function.

In [29]:
print(df.isnull().sum())

Country      0
Age          0
Salary       0
Purchased    0
dtype: int64


So we have removed all the null values and replaced them with the most frequent value of that particular column.

# Null Values Handling on GooglePlaystore Dataset

Now we have already seen how we can handle null values for both Numeric and Categorical Data. Let use use those techniques and handle the null values for our GooglePlaystore Dataset.

So let us see our dataset again before getting started.

In [31]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

df = pd.read_csv("googleplaystore.csv")

df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


Let us use the isnull().sum() function to check the number of null data in every column.

In [32]:
print(df.isnull().sum())

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64


So we have a huge number of data which is null in the Rating column. We can drop the rows which are null but that would not be optimal because the number of data. is very much. So we are going to use the SimpleImputer  and replace the null values with the mean of the Rating column.

Let's get started with the coding implementation now.

In [33]:
impute = SimpleImputer(missing_values = np.nan , strategy = 'mean')

df.iloc[ : , 2:3 ] = impute.fit_transform(df.iloc[ : , 2:3 ].values)

So we have replaced the null values with the mean value for the Rating column and there is no loss of data as well. Let us use the isnull().sum() function to again check our null values.

In [34]:
print(df.isnull().sum())

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              1
Price             0
Content Rating    1
Genres            0
Last Updated      0
Current Ver       8
Android Ver       3
dtype: int64


So now we have some more null values present in the other columns. So we can see they are very less as compared to the size of the dataset( less than 1%) so we can just remove those null values.

In [35]:
df = df.dropna()

print(df.isnull().sum())

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64


So we have removed the rest of the null values and we have no null values present in our dataset anymore.