In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Loading data from CSV file into data frame.

In [3]:
appsdata = pd.read_csv('apps.csv')
appsdata.head()

Unnamed: 0.1,Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [4]:
appsdata.shape

(9659, 14)

### Columns of dataset.

In [5]:
appsdata.columns

Index(['Unnamed: 0', 'App', 'Category', 'Rating', 'Reviews', 'Size',
       'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated',
       'Current Ver', 'Android Ver'],
      dtype='object')

In [6]:
appsdata.dtypes

Unnamed: 0          int64
App                object
Category           object
Rating            float64
Reviews             int64
Size              float64
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

### Primary observations of dataset.

- The dataset has 9,659 rows and 14 columns.
- Following columns are there in the dataset.
    - Unnamed: 0
    - App
    - Category
    - Rating
    - Reviews
    - Size
    - Installs
    - Type
    - Price
    - Content Rating
    - Genres
    - Last Updated
    - Current Ver
    - Android Ver
- Therea are 4 numeric columns in the dataset.
- By looking at the data, there is still a scope of changing to appropriate types for some columns.

### Null value analysis.

Let's look at null values in the dataset.

In [7]:
(appsdata.isnull().sum() / len(appsdata)) * 100

Unnamed: 0         0.000000
App                0.000000
Category           0.000000
Rating            15.146495
Reviews            0.000000
Size              12.703178
Installs           0.000000
Type               0.000000
Price              0.000000
Content Rating     0.000000
Genres             0.000000
Last Updated       0.000000
Current Ver        0.082824
Android Ver        0.020706
dtype: float64

### Observation

- Rating and Size columns have 15% and 12% null values respectively.

In [13]:
appsdata.value_counts('Installs')

Installs
1,000,000+        1417
100,000+          1112
10,000+           1031
10,000,000+        937
1,000+             888
100+               710
5,000,000+         607
500,000+           505
50,000+            469
5,000+             468
10+                385
500+               328
50+                204
50,000,000+        202
100,000,000+       188
5+                  82
1+                  67
500,000,000+        24
1,000,000,000+      20
0+                  14
0                    1
Name: count, dtype: int64

### Data clean-up

- **Unnamed: 0** colunn does not have any significance on rest of the columns in the dataset hence it can be removed.
- **Installs** column have numeric data but have `,` in it and `+` at the end of them. Removing `+` and `,` from it will more sense.
- **Last Updated** does not have correct format of the date value. Values of this column should be changed to correct format.

Lets' remove **Unnamed: 0** column from the dataset.


In [9]:
appsdata.drop('Unnamed: 0', axis=1, inplace=True)

In [15]:
# Create function to convert number with , and + to proper integer number.

def convertstrtonum(s):
    s = str.replace(s, ',', '')
    if s[-1] == '+':
        return int(s[:-1])
    else:
        return int(s)


In [16]:
# Apply the above function to correct data in Installs column.

appsdata['Installs'] = appsdata['Installs'].apply(convertstrtonum)

Datatype of **Installs** column is now changed to **int64**.

In [19]:
appsdata['Installs'].dtype

dtype('int64')

In [22]:
# Let's correct the data type of Last Updated column to to DateTime.

appsdata['Last Updated'] = pd.to_datetime(appsdata['Last Updated'])

In [31]:
appsdata.dtypes

App                       object
Category                  object
Rating                   float64
Reviews                    int64
Size                     float64
Installs                   int64
Type                      object
Price                     object
Content Rating            object
Genres                    object
Last Updated      datetime64[ns]
Current Ver               object
Android Ver               object
dtype: object

In [24]:
appsdata.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,10000,Free,0,Everyone,Art & Design,2018-01-07,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0,Everyone,Art & Design;Pretend Play,2018-01-15,2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,5000000,Free,0,Everyone,Art & Design,2018-08-01,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,50000000,Free,0,Teen,Art & Design,2018-06-08,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,100000,Free,0,Everyone,Art & Design;Creativity,2018-06-20,1.1,4.4 and up
