# Google Play Store App

In [1]:
pip install ydata-profiling




### ***Import Libraries***

In [2]:
import pandas as pd
import ydata_profiling as yd

**ydata-profiling** is a tool in Python that automatically checks your dataset and gives you a full report about it.

It shows:

1. How many rows and columns your data has

2. Missing values and duplicates

3. Stats for each column (like mean, max, min)

4. Charts and graphs to understand the data

5. Warnings for problems in your data

### ***Mount Drive***

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
import os
os.chdir("/content/drive/MyDrive/Colab Notebooks/Machine Learning")

### ***Import Dataset***

In [5]:
df = pd.read_csv("Google-Playstore.csv")

### ***Run automatic eda using ydata-profiling library***

In [6]:
report = yd.ProfileReport(
    df,
    explorative=True,
    minimal=True  # only basic stats, faster and uses less RAM
)
report.to_file("playstore_report.html")


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/24 [00:00<?, ?it/s][A
  4%|▍         | 1/24 [00:59<22:54, 59.76s/it][A
 29%|██▉       | 7/24 [01:00<01:49,  6.41s/it][A
 42%|████▏     | 10/24 [01:02<00:57,  4.10s/it][A
 46%|████▌     | 11/24 [01:03<00:47,  3.65s/it][A
 50%|█████     | 12/24 [01:05<00:41,  3.42s/it][A
 54%|█████▍    | 13/24 [01:06<00:30,  2.80s/it][A
 58%|█████▊    | 14/24 [01:19<00:51,  5.20s/it][A
 62%|██████▎   | 15/24 [01:47<01:35, 10.64s/it][A
 67%|██████▋   | 16/24 [01:47<01:03,  7.91s/it][A
100%|██████████| 24/24 [02:07<00:00,  5.31s/it]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [7]:
from google.colab import files
files.download("playstore_report.html")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [9]:
df.head()

Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,...,Developer Website,Developer Email,Released,Last Updated,Content Rating,Privacy Policy,Ad Supported,In App Purchases,Editors Choice,Scraped Time
0,Gakondo,com.ishakwe.gakondo,Adventure,0.0,0.0,10+,10.0,15,True,0.0,...,https://beniyizibyose.tk/#/,jean21101999@gmail.com,"Feb 26, 2020","Feb 26, 2020",Everyone,https://beniyizibyose.tk/projects/,False,False,False,2021-06-15 20:19:35
1,Ampere Battery Info,com.webserveis.batteryinfo,Tools,4.4,64.0,"5,000+",5000.0,7662,True,0.0,...,https://webserveis.netlify.app/,webserveis@gmail.com,"May 21, 2020","May 06, 2021",Everyone,https://dev4phones.wordpress.com/licencia-de-uso/,True,False,False,2021-06-15 20:19:35
2,Vibook,com.doantiepvien.crm,Productivity,0.0,0.0,50+,50.0,58,True,0.0,...,,vnacrewit@gmail.com,"Aug 9, 2019","Aug 19, 2019",Everyone,https://www.vietnamairlines.com/vn/en/terms-an...,False,False,False,2021-06-15 20:19:35
3,Smart City Trichy Public Service Vehicles 17UC...,cst.stJoseph.ug17ucs548,Communication,5.0,5.0,10+,10.0,19,True,0.0,...,http://www.climatesmarttech.com/,climatesmarttech2@gmail.com,"Sep 10, 2018","Oct 13, 2018",Everyone,,True,False,False,2021-06-15 20:19:35
4,GROW.me,com.horodyski.grower,Tools,0.0,0.0,100+,100.0,478,True,0.0,...,http://www.horodyski.com.pl,rmilekhorodyski@gmail.com,"Feb 21, 2020","Nov 12, 2018",Everyone,http://www.horodyski.com.pl,False,False,False,2021-06-15 20:19:35




In [14]:
df.info() #no of variables

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2312944 entries, 0 to 2312943
Data columns (total 24 columns):
 #   Column             Dtype  
---  ------             -----  
 0   App Name           object 
 1   App Id             object 
 2   Category           object 
 3   Rating             float64
 4   Rating Count       float64
 5   Installs           object 
 6   Minimum Installs   float64
 7   Maximum Installs   int64  
 8   Free               bool   
 9   Price              float64
 10  Currency           object 
 11  Size               object 
 12  Minimum Android    object 
 13  Developer Id       object 
 14  Developer Website  object 
 15  Developer Email    object 
 16  Released           object 
 17  Last Updated       object 
 18  Content Rating     object 
 19  Privacy Policy     object 
 20  Ad Supported       bool   
 21  In App Purchases   bool   
 22  Editors Choice     bool   
 23  Scraped Time       object 
dtypes: bool(4), float64(4), int64(1), object(15)
memor

### ***Take sample from big dataset***

In [34]:
df.sample(10)

Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,...,Developer Website,Developer Email,Released,Last Updated,Content Rating,Privacy Policy,Ad Supported,In App Purchases,Editors Choice,Scraped Time
1671887,Bowling Green Area CVB,com.visitapps.bowlinggreen,Travel & Local,0.0,0.0,"1,000+",1000.0,1202,True,0.0,...,,visitapps@simpleviewinc.com,"Feb 14, 2017","Dec 10, 2019",Everyone,https://www.visitbgky.com/privacy-policy/,False,False,False,2021-06-16 03:47:42
1905264,LQSA Quiz Juego de laqueseavecina Trivial,com.juegosdelqsa.quiztrivial,Trivia,3.4,33.0,"5,000+",5000.0,5938,True,0.0,...,http://www.megaquiztrivia2.blogspot.com,megaquiztrivia@gmail.com,"Nov 26, 2019","Nov 26, 2019",Everyone,https://lqsaquiztrivia.blogspot.com/2019/11/pr...,True,True,False,2021-06-16 07:02:59
1279249,SHARING CAM,com.camerite.vigivel.app,Productivity,3.7,7.0,"1,000+",1000.0,1024,True,0.0,...,http://vigivel.com.br,diradm@vigivel.com.br,"Nov 29, 2016","Feb 18, 2021",Everyone,http://cameras.vigivel.com.br/about/privacy,False,False,False,2021-06-15 21:52:29
1635018,412 Food Rescue,com.fouronetwo.foodrescue,Food & Drink,3.9,40.0,"5,000+",5000.0,6173,True,0.0,...,http://412foodrescue.org/,info@412foodrescue.org,"Nov 9, 2016","May 20, 2021",Everyone,https://public.foodrescuehero.org/412_food_res...,False,False,False,2021-06-16 03:16:53
2156985,Qatar Podcast,com.qatar.podcast,Music & Audio,0.0,0.0,1+,1.0,2,True,0.0,...,,vivienechinap@gmail.com,"Aug 24, 2020","Aug 24, 2020",Everyone,,True,False,False,2021-06-16 10:43:02
2305084,airbubbl,com.r4s.airbubbl,Auto & Vehicles,5.0,6.0,100+,100.0,357,True,0.0,...,http://airbubbl.com,developer@airbubbl.com,"Nov 2, 2018","Nov 28, 2019",Everyone,https://airbubbl.com/privacy-policy,False,False,False,2021-06-16 12:52:24
406061,HUMIDITY CALCULATOR LAB-EL,pl.label.humidity_calculator,Tools,4.3,7.0,"5,000+",5000.0,6387,True,0.0,...,,android@label.pl,"Jan 26, 2016","Jan 26, 2016",Everyone,,False,False,False,2021-06-16 03:10:43
1755047,Alphabet Soup with 20 different language,com.lu.alphabet_hidato,Puzzle,3.5,6.0,"1,000+",1000.0,1918,True,0.0,...,http://www-personal.umich.edu/~zhlu/,zhlu@umich.edu,"Mar 3, 2012","Mar 16, 2019",Everyone,https://sites.google.com/view/lu-com-policypage,True,False,False,2021-06-16 04:56:30
916922,Green Education Teacher,apph.greeneducationteacher,Tools,0.0,0.0,10+,10.0,19,True,0.0,...,,h110281@gmail.com,"Jul 4, 2017","Aug 05, 2018",Everyone,,False,False,False,2021-06-16 11:01:20
1241108,Inmobiliaria Aguilar,com.framelova.appaguilardemo,Tools,4.9,30.0,500+,500.0,802,True,0.0,...,https://casasaguilar.com.mx,aguilar@framelova.com,,"May 23, 2019",Everyone,https://www.casasaguilar.com.mx/aviso-de-priva...,False,False,False,2021-06-15 21:18:29


In [35]:
df.describe()

Unnamed: 0,Rating,Rating Count,Minimum Installs,Maximum Installs,Price
count,2290061.0,2290061.0,2312837.0,2312944.0,2312944.0
mean,2.203152,2864.839,183445.2,320201.7,0.1034992
std,2.106223,212162.6,15131440.0,23554950.0,2.633127
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,50.0,84.0,0.0
50%,2.9,6.0,500.0,695.0,0.0
75%,4.3,42.0,5000.0,7354.0,0.0
max,5.0,138557600.0,10000000000.0,12057630000.0,400.0


In [36]:
df['Price'].unique()

array([ 0.      ,  1.99    ,  4.99    , ...,  3.041816, 26.746362,
       18.903596])

In [37]:
df['Price'].value_counts()

Unnamed: 0_level_0,count
Price,Unnamed: 1_level_1
0.000000,2268011
0.990000,11851
1.990000,5817
2.990000,3921
1.490000,3823
...,...
3.550000,1
16.880000,1
4.919821,1
3.395531,1


In [38]:
df['Price'].isnull().sum()

np.int64(0)

In [15]:
len(df) #no of observations

2312944

In [8]:
df.columns

Index(['App Name', 'App Id', 'Category', 'Rating', 'Rating Count', 'Installs',
       'Minimum Installs', 'Maximum Installs', 'Free', 'Price', 'Currency',
       'Size', 'Minimum Android', 'Developer Id', 'Developer Website',
       'Developer Email', 'Released', 'Last Updated', 'Content Rating',
       'Privacy Policy', 'Ad Supported', 'In App Purchases', 'Editors Choice',
       'Scraped Time'],
      dtype='object')

In [10]:
df.isnull()

Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,...,Developer Website,Developer Email,Released,Last Updated,Content Rating,Privacy Policy,Ad Supported,In App Purchases,Editors Choice,Scraped Time
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2312939,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
2312940,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2312941,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
2312942,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### ***Missing cells***

In [18]:
df.isnull().sum().sum()

np.int64(1305751)

In [17]:
df.isnull().sum() / len(df) * 100

Unnamed: 0,0
App Name,0.000216
App Id,0.0
Category,0.0
Rating,0.989345
Rating Count,0.989345
Installs,0.004626
Minimum Installs,0.004626
Maximum Installs,0.0
Free,0.0
Price,0.0


In [12]:
df.dtypes

Unnamed: 0,0
App Name,object
App Id,object
Category,object
Rating,float64
Rating Count,float64
Installs,object
Minimum Installs,float64
Maximum Installs,int64
Free,bool
Price,float64


### ***Duplicate rows***

In [19]:
df.duplicated()

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
2312939,False
2312940,False
2312941,False
2312942,False


In [20]:
print(df.duplicated())

0          False
1          False
2          False
3          False
4          False
           ...  
2312939    False
2312940    False
2312941    False
2312942    False
2312943    False
Length: 2312944, dtype: bool


In [21]:
print(df.duplicated().sum())

0


### ***Memeory usage***

In [22]:
df.memory_usage(deep=True).sum()

np.int64(2325629211)

In [25]:
df.memory_usage('deep')

Unnamed: 0,0
Index,132
App Name,18503552
App Id,18503552
Category,18503552
Rating,18503552
Rating Count,18503552
Installs,18503552
Minimum Installs,18503552
Maximum Installs,18503552
Free,2312944


### ***Memory size in GBs***

In [24]:
df.memory_usage(deep=True).sum() / (1024**3)

np.float64(2.165910984389484)

In [26]:
print(df['App Name'])

0                                                    Gakondo
1                                        Ampere Battery Info
2                                                     Vibook
3          Smart City Trichy Public Service Vehicles 17UC...
4                                                    GROW.me
                                 ...                        
2312939                                             大俠客—熱血歸來
2312940                                           ORU Online
2312941                                       Data Structure
2312942                                          Devi Suktam
2312943                         Biliyor Musun - Sonsuz Yarış
Name: App Name, Length: 2312944, dtype: object


In [29]:
df['App Name']

Unnamed: 0,App Name
0,Gakondo
1,Ampere Battery Info
2,Vibook
3,Smart City Trichy Public Service Vehicles 17UC...
4,GROW.me
...,...
2312939,大俠客—熱血歸來
2312940,ORU Online
2312941,Data Structure
2312942,Devi Suktam


In [32]:
df['App Id'].unique()

array(['com.ishakwe.gakondo', 'com.webserveis.batteryinfo',
       'com.doantiepvien.crm', ...,
       'datastructure.appoworld.datastucture', 'ishan.devi.suktam',
       'com.yyazilim.biliyormusun'], dtype=object)

In [33]:
df['App Id'].nunique()

2312944