---   

<h1 align="center">Introduction to Data Analyst and Data Science for beginners</h1>
<h1 align="center">Lecture no 2.11(Pandas-02)</h1>

---
<h3><div align="right">Ehtisham Sadiq</div></h3>    

<img align="right" width="400" height="400"  src="images/pandas-apps.png"  >

## _Overview of Pandas Series Data Structure.ipynb_

#### Read about Pandas Data: https://pandas.pydata.org/docs/user_guide

### Recap:
**In the last lecture, we discussed the pandas library and overview of the pandas dataframe and its anatomy and we have seen that every pandas dataframe is composed of a series objects.**

## Learning agenda of this notebook

1. Overview of Python Pandas library and its data structures
2. Creating a Series
    - From Python List
    - From NumPy Arrays
    - From Python Dictionary
    - From a scalar value
    - Creating empty series object
3. Attributes of a Pandas Series
4. Understanding Index in a Series and its usage
    - Identification
    - Selection/Filtering/Subsetting
    - Alignment

In [None]:
# To install this library in Jupyter notebook
import sys
!{sys.executable} -m pip install pandas --quiet

In [1]:
import pandas as pd
pd.__version__ , pd.__path__

('1.4.2', ['/home/dell/.local/lib/python3.8/site-packages/pandas'])

<img align="right" width="500" height="600"  src="images/series-anatomy.png"  >

## 1. Creating a Series
> **A Series is a one-dimensional array capable of holding a sequence of values of any data type (integers, floating point numbers, strings, Python objects etc) which by default have numeric data labels starting from zero. You can imagine a Pandas Series as a column in a spreadsheet or a Pandas Dataframe object.**
- To create a Series object you can use `pd.Series()` method

**```pd.Series(data, index, dtype, name)```**
- Where,
   - `data`: can be a Python list, Python dictionary, numPy array, or a scalar value.
   - `index`: If you donot pass the index argument, it will default to `np.arrange(n)`. Indices must be hashable (numbers or strings) and have the same length as `data`. Non-unique index values are allowed. Index is used for three purposes:
       - Identification.
       - Selection.
       - Alignment.
   - `dtype`: Optionally, you can assign any valid numpy datatype to the series object (np.sctypes). If not specified, this will be inferred from `data`.
   - `name`: Optionally, you can assign a name to a series, which becomes attribute of the series object. Moreover, it becomes the column name, if that series object is used to create a dataframe later.

### a. Creating a Series from Python List

In [None]:
import pandas as pd
import numpy as np
list1 = ['Ehtisham', 'Ali', 'Ayesha', '','Dua']  # note the empty string

# When index is not provided, it creates an index for the data starting from zero and with a step size of one.
s = pd.Series(data=list1)
print(s)
print(type(s))

>Observe that output is shown in two columns - the index is on the left and the data value is on the right. If we do not explicitly specify an index for the data values while creating a series, then by default indices range from 0 through N – 1. Here N is the number of data elements.

**You can explicitly specify the index for a Series object, which can be either int or string type, and must be of the same size as the values in the series. Otherwise, it will raise a ValueError**

In [None]:
list1 = ['Ehtisham', 'Ali', 'Ayesha', 'Dua']
indices = ['MS01', 'MS02', '', 'MS02']   # non-unique index values are allowed and you can have empty string as index

s = pd.Series(data=list1, index=indices)
print(s)
print(type(s))

>Also note that non-unique indices are allowed

In [None]:
list1 = ['Ehtisham', 'Ali', 'Ayesha', 'Dua']
indices = [2.1, 2.2, 2.3, 2.4]   

s = pd.Series(data=list1, index=indices)
print(s)
print(type(s))

**You can create a series with NaN values, using `np.nan`, which is IEEE 754 floating-point representation of Not a Number. NaN values can act as a placeholder for any missing numerical values in the array.**

In [None]:
list1 = [1, 2.7, np.nan, 54]
s = pd.Series(data=list1)
print(s)
print(type(s))

>Also note the `dtype` of the series object is inferred from the data as `float64`

**You can use the `dtype` argument to specify a datatype to the series object.**

In [None]:
list1 = [27, 33, 19]
s = pd.Series(data=list1, dtype=np.uint8)
print(s)
print(type(s))

**Optionally, you can assign a name to a series, which becomes attribute of the series object. Moreover, it becomes the column name, if that series object is used to create a dataframe later.**

In [None]:
list1 = ['Ehtisham', 'Ali', 'Ayesha', 'Dua']
indices = ['MS01', 'MS02', 'MS03', 'MS04']
s = pd.Series(data=list1, index=indices, name='myseries1') 
print(s)
print(type(s))

### b. Creating a Series from NumPy Array

In [None]:
np.arange(4)

In [None]:
s = pd.Series(data = np.arange(4))
print(s)
print(type(s))

In [None]:
arr1 = np.array([22.3,33.6, 98, 44])
s = pd.Series(data=arr1, dtype='float64')
print(s)
print(type(s))

### c. Creating a Series from Python Dictionary

In [None]:
my_dict = {
    'name':"Ehtisham", 
    'gender':"Male", 
    'Role':"Student", 
    'subject':"Operating System"}
s = pd.Series(data=my_dict)
print(s)
print(type(s))

**When you create a series from dictionary, it will automatically take the keys as index and the value as data**

### d. Creating a Series from Scalar value

In [None]:
s = pd.Series(data=25)
print(s)
print(type(s))

### e. Creating an Empty Series

In [None]:
# Need to pass atleast `dtype` else you get a warning
s=pd.Series()
print(s)
print(type(s))

## 3. Attributes of Pandas  Series
- We can access certain properties called attributes of a series by using that property with the series name using dot `.` notation
- Mostly attributes of pandas series are similar to pandas dataframe.

In [None]:
my_dict = {0:"Ehtisham", 1:np.nan, 2:"Ali", 3:"Ayesha", 4:"Dua", 5:"Khubaib", 6:"Adeen"}
s = pd.Series(my_dict, name="myseries1")
s

In [None]:
# `name` attribute of a series object return the name of the series object
s.name

In [None]:
# `index` attribute of a series object return the list of indices and its datatype
s.index

In [None]:
# `values` attribute of a series object return the list of values and its datatype
s.values

In [None]:
# `dtype` attribute of a series object return the type of underlying data
s.dtype

In [None]:
# `shape` attribute of a series object return a tuple of shape of underlying data
s.shape

In [None]:
# `nbytes` attribute of a series object return the number of bytes of underlying data (object data type take 8 bytes)
s.nbytes

In [None]:
# `size` attribute of a series object return number of elements in the underlying data
s.size

In [None]:
# `ndim` attribute of a series object return number of dimensions of underlying data
s.ndim

In [None]:
# `hasnans` attribute of a series object return true if there are NaN values in the data
s.hasnans

<img align="right" width="500" height="500"  src="images/series-anatomy.png"  >

## 4. Understanding Index in a Series
- Every series object has an index associated with every item. 
- The Pandas series object supports both integer-based (default) and label/string-based indexing and provides a host of methods for performing operations involving the index.
<br><br>
    - When index is unique, Pandas use a hashtable to map `key to value` and searching can be done in O(1) time. 
    - When index is non-unique but sorted, Pandas use binary search, which takes logarithmic time O(logN).
    - When index is randomly ordered, searching takes linear time, as Pandas need to check all the keys in the index O(N).<br><br>
- Index in series object is used for three purposes:
    - Identification
    - Selection/Filtering/Subsetting
    - Alignment <br><br>

### a. Changing Index of a Series Object
- In above examples, we have seen that
    - If we create a Series object from dictionary, the keys of dictionray become the index 
    - If we create a Series object from a list or numPy array, the index defaults to integers from 0, 1, 2, ...
    - Last but not the least, we can assign the indices of our own choice, which can be integers or strings
- Let us see as how we can change the indices of a series object after creation

In [None]:
list1 = ['Ehtisham', 'Ali', 'Ayesha', 'Dua','Adeen']

s = pd.Series(data=list1)
print(s)
print(s.index)

>Index attribute of series object shows that index range for this series is from (0-4) with step value of 1

**Let us modify the index of this series object to some random integers by assigning a random array of integers to `index` attribute of this series object**

In [None]:
arr1 = np.random.randint(low = 100, high = 200, size = 5)

arr1

In [None]:
s.index = arr1

print(s)
print(s.index)

In [None]:
s.index = [1,4,2,6.3,9]

print(s)
print(s.index)

**Changing index of a series to a list of strings**

In [None]:
list1 = ['Ehtisham', 'Ali', 'Ayesha', 'Dua','Adeen']

s = pd.Series(data=list1)
print(s)
print(s.index)

In [None]:
indices = ['num1', 'num2', 'num3', 'num4', 'num5']

s.index = indices

print(s)
print(s.index)

<img align="right" width="300" height="300"  src="images/series-anatomy.png"  >

### b. First use of Index (Identification)
- The main purpose of index is to identify or search the element of series object.
- Since every data value of a series object has an associated index (integer or string). So we can use this index/label to identify or access data value(s)
- There are three ways to access elements of a series:
    - Using `s[]` operator and specifying the index (integer/label)
    - Using `s.loc[]` method and specifying the index (integer/label)
    - Using `s.iloc[]` method and specify the position (an integer value from 0 to length-1). It also support negative indexing, the last element can be accessed by an index of -1

**Identification using Integer Indices or by Position**

In [None]:
list1 = ['Ehtisham', 'Ali', 'Ayesha', 'Dua','Adeen']

indices = [5, 10, 15, 20, 25]
s = pd.Series(data=list1, index=indices)
s

In [None]:
# Give index to subscript operator
s[25]

# Subscript operator do not work on position
# s[0] # will raise an error because index 0 do not exist

In [None]:
# Give index to  loc method
s.loc[20]
# loc method do not work on position
# s.loc[0] # will raise an error because index 0 do not exist

In [None]:
# iloc method is position based, so will flag an error if you pass an actual index
# s.iloc[20] 

In [None]:
# The iloc method is passed position and not index
s.iloc[3]


**Fancy Indexing**

In [None]:
# Can access multiple values by specifying a list of indices
s[[20, 5]]

In [None]:
# Can access multiple values by specifying a list of indices
s.loc[[20, 5]]

In [None]:
# Can access multiple values by specifying list of positions
s.iloc[[3, 0]]

**Negative Indexing, work only for `iloc`**

In [None]:
s

In [None]:
# s[-1]
# s.loc[-1]
s.iloc[-1]

**Identification using String Indices or by Position**

In [None]:
list1 = ['Ehtisham', 'Ali', 'Ayesha', 'Dua','Adeen']

indices = ['num1', 'num2', 'num3', 'num4', 'num5']
s = pd.Series(data=list1, index=indices)
s

In [None]:
# Give index to subscript operator (which in this case is a string or label)
s['num1']

In [None]:
# for position as well
s[2]

In [None]:
# Give index to  loc method (which in this case is a string or label)
s.loc['num1']

In [None]:
# Will not work on position the way [] worked previously
#s.loc[0]

In [None]:
# iloc method is position based, so will flag an error if you pass it string indices
#s.iloc['num1'] 
# however will work fine if you pass an integer specifying the position
s.iloc[0]

In [None]:
s.iloc[-1]

**Fancy Indexing**

In [None]:
# Can access multiple values by specifying a list of indices (which in this case are strings or labels)
s[['num3', 'num1']]

In [None]:
# Can access multiple values by specifying a list of indices (which in this case are strings or labels)
s.loc[['num3', 'num1']]

In [None]:
# iloc method is position based, so will flag an error if you pass it string indices
#s.iloc['num3', 'num1'] 
# however will work fine if you pass an integer specifying the position
s.iloc[[2,0]]

<img align="right" width="400" height="400"  src="images/series-anatomy.png"  >

### c. Second use of Index (Selection)
- A series can be sliced using `:` symbol, which returns a subset of a series object (values with corresponding indices).
- There are three arguments of slice object `[[start]:[stop][:step]]`, and all are optional

- The slice object can be used in three ways to slice a Pandas Series object::
    - Using `s[]` operator and specifying the index (integer/label)
    - Using `s.loc[]` method and specifying the index (integer/label)
    - Using `.iloc` method and specify the position (an integer value from 0 to length-1). It also support negative indexing, the last element can be accessed by an index of -1
- Keep following points in mind:
    - The `stop` argument is NOT inclusive for `s[]` for integer indices, while it is inclusive for string indices.
    - The `stop` argument is inclusive for `s.loc[]` for both integer and label indices.
    - The `stop` argument is NOT inclusive for `s.iloc[]` being position based.
  
>**Note: Once you slice a Pandas series, you get a view of the original object, which is similar to shallow copy. So if you modify an element in original series object, the change will also be visible in the other series object.**

**Selection/Filtering/Subsetting of Series object having Integer indices**

In [None]:
list1 = ['Ehtisham', 'Ali', 'Ayesha', 'Dua','Adeen']

indices = [5, 10, 15, 20, 25]
s = pd.Series(data=list1, index=indices)
s

In [None]:
s[5:15]

In [None]:
# The subscript operator considers the slice object as positional index and not as the actual indices 
# (if we have integer indices)
# The `stop` argument is NOT inclusive for `s[]` for integer indices
s[1:4]

In [None]:
#The loc[] method considers the slice object as actual indices and not as positional indices
# The stop argument is inclusive for `s.loc[]` for both integer and label indices
s.loc[5:15]

In [None]:
# The iloc[] method considers the slice object as positional index and not as the actual indices
# The `stop` argument is NOT inclusive for `s.iloc[]` being position based
s.iloc[1:4]

**Selection/Filtering/Subsetting of Series object having String Indices**

In [None]:
list1 = ['Ehtisham', 'Ali', 'Ayesha', 'Dua','Adeen']

indices = ['num1', 'num2', 'num3', 'num4', 'num5']
s = pd.Series(data=list1, index=indices)
s

In [None]:
s[0:2]

In [None]:
# The subscript operator considers the slice object as positional index and not as the actual indices
# (if we have integer indices). However, will also consider the actual indices in case of string indices
# The `stop` argument is inclusive for `s[]` for string indices, while it is NOT inclusive for integer indices.
s['num2':'num4']

In [None]:
# The `stop` argument is inclusive for `s[]` for string indices, while it is NOT inclusive for integer indices.
s[0:2]

In [None]:
#The loc[] method considers the slice object as actual indices and not as positional indices
# The stop argument is inclusive for `s.loc[]` for both integer and label indices
s.loc['num2':'num4']

In [None]:
# The iloc[] method considers the slice object as positional index and not as the actual indices
# iloc method is position based, so will flag an error if you pass it string indices
#s.iloc['num2': 'num4'] 
# however will work fine if you pass an integer values (specifying positions) in the slice operator
# Moreover the stop index is not inclusive
s.iloc[1:4]

**Understanding Step with Series object having String Indices**

In [None]:
s

In [None]:
# The step works fine with string indices as well
s['num2':'num5':1]

In [None]:
s['num2':'num5':2]

In [None]:
s['num5':'num3':-1]

<img align="right" width="300" height="300"  src="images/series-anatomy.png"  >

### d. Third use of Index (Alignment)
- We can perform basic arithmetic operations like addition, subtraction, multiplication, division, etc., on two Series objects, to produce a new Series instance.
- The operation is done on each corresponding pair of elements. This is done by matching the indices of the two series objects.

**Example 1:** Adding two series object with same integer indices

In [None]:
list1 = [1,3,5,7,9];
list2 = [2,4,6,8,10];
s1 = pd.Series(data=list1);
s2 = pd.Series(data=list1);

In [None]:
print(s1)
print(s1.index)

In [None]:
print(s2)
print(s2.index)

In [None]:
s3 = s1 + s2
print(s3)
print(s3.index)

**Example 2:** Adding two series object having different integer indices

In [None]:
list1 = [6,9,7,5]
index1 = [0,1,2,3]
list2 = [8,6,2,1]
index2 = [0,2,3,5]
s1 = pd.Series(data=list1, index=index1);
s2 = pd.Series(data=list2, index=index2);

In [None]:
s1,s2

In [None]:
print(s1)
print(s1.index)

In [None]:
print(s2)
print(s2.index)

In [None]:
s3 = s1 + s2
print(s3)
print(s3.index)

**Problem:** While performing mathematical operations on series having mismatched indices, all missing values are filled in with NaN by default.

**Solution:** To handle this problem, instead of using the operators (`+, -, *, /`), an explicit call to `s.add()`, `s.sub()`, `s.mul()` and `s.div()` is preferred. This allows us to replace the missing values in any of the series witth a specific value, so as to have a concrete output in place of NaN

In [None]:
s1.add(s2, fill_value=0) # Compare it with above result

**Example 3:** Adding two series object having different string indices

In [None]:
list1 = [6,9,7,5, 2]
labels1 = ['num1', 'num2', 'num3', 'num4', 'num5']

list2 = [8,6,2,3,6]
labels2 = ['num1', 'num2', 'num3', 'num8', 'num5']

s1 = pd.Series(data=list1, index=labels1)
s2 = pd.Series(data=list2, index=labels2)


In [None]:
print(s1)
print(s1.index)

In [None]:
print(s2)
print(s2.index)

In [None]:
# Let us use the `add()` method
#s1+s2
s3 = s1.add(s2, fill_value=5)
#s3 = s1.add(s2)
print(s3)
print(s3.index)

In [None]:
import pandas as pd
import numpy as np

In [None]:
a = np.arange(1,20,2)
a

In [None]:
s = pd.Series(a)
s

In [None]:
s.shape

In [None]:
# s[s>10]
# s[s%3==0]

In [None]:
# s[[3,5]]
s.loc[3:6]

In [None]:
s.iloc[3:6]

**My dear fellows, please make time to practice following topics related to Series:**
- Boolean/Fancy Indexing and Slicing
- Use of `reset_index()` method for completely resetting the index
- Use of other manipulation methods like 
    - `s.pop(index)` is passed an index and it returns the data item at the index and removes it from series
    - `s.drop(indexes)` is passed one or a list of indices and returns a series of the data items. Series remains unchanged unless the inplace = True argument is passed
    - `s1.append(s2, ignore_index=False, verify_integrity=False)` is used to concatenate two series and return the concatenated series, original series remain unchanged
    - `s1.update(s2)` is used to miduft the series `s1` inplace using the values from passed series
>**We will discuss these while studying Pandas Dataframe object InshaAllah**

In [None]:
s1.reset_index()

In [None]:
s

In [None]:
s.pop(3)

In [None]:
s.drop([2,4,6], inplace=True)

In [None]:
s

# Pandas Series vs NumPy 1-D Arrays
>- In a series object we can define our own labeled index to access elements of an array. These can be numbers or strings. NumPy arrays are accessed  by their integer position using numbers only.
>- In a series object the elements can be indexed in descending order also. In NumPy arrays, the indexing starts with zero for the first element and the index is fixed.
>- While performing arithmetic operations on series having misaligned indices, NaN or missing values may be generated. In NumPy arrays, the concept of broadcasting exist and there is no concept of NaN values. While performing arithmetic on incompatible numPy arrays the operation fails.
>- Series require more memory. NumPy arrays occupies lesser memory.
    
    

## Check Your Concepts:
-  What is Series in Pandas?
- Create a Pandas Series from array 
- Creating a Pandas Series from Dictionary 
- Creating a Pandas Series from Lists 
- Create Pandas Series using NumPy functions 
- Access the elements of a Series in Pandas 

## Practice Questions:
- Write a Pandas program to convert a Panda module Series to Python list and it’s type.
- Write a Pandas program to add, subtract, multiple and divide two Pandas Series having same indices.
- Write a Pandas program to compare the elements of the two Pandas Series.(Hint : pd.eq / pd.equals)
- Write a Pandas program to change the data type of given a column or a Series.
- Write a Pandas program to convert a given Series to an array(Hint : series.values.tolist())
- Write a Pandas program to sort a given Series.
- Write a Pandas program to add some data to an existing Series.(Hint : series.append()) 
- Write a Pandas program to create the mean and standard deviation of the data of a given Series.
- Write a Pandas program to get the items of a given series not present in another given series.(series.isin())

In [None]:
# series into list
x = list(s)
x, type(x)

In [None]:
s1 = pd.Series(np.arange(1,11))
s2 = pd.Series(np.arange(1,20,2))
print(s1)
print(s2)

In [None]:
s1+s1 # s1.add(s2)

In [None]:
s2-s1

In [None]:
s2*s1

In [None]:
s2/s1

In [None]:
# compare elements of two series
s1.eq(s2)

In [None]:
# change datatype of any series
s1.astype(float)

In [None]:
# series into numpy array
s1.values.tolist()

In [None]:
np.array(s1.values.tolist())

In [None]:
# s1.sort_index(ascending=False)
s1.sort_values(ascending=False)

In [None]:
s2.append(pd.Series(21),ignore_index=True)

In [None]:
print("Mean of series : ",s1.mean())
print("Std of Series : ", s1.std())

In [None]:
s1,s2

In [None]:
# get elements of one series that are not in other series
s1[s1.isin(s2)]

In [None]:
s1[~s1.isin(s2)]

In [None]:
s1[~s1.isin(s2)]

### Write a Pandas program to convert Series of lists to one Series.(Hint : series.apply())
```
Sample Output:

Original Series of list
0    [Red, Green, White]
1           [Red, Black]
2               [Yellow]
dtype: object
One Series
0       Red
1     Green
2     White
3       Red
4     Black
5    Yellow
dtype: object                      
```

In [None]:
list1 = [['Red', 'Green', 'White'],['Red','Black'],['Yellow']]
s3 = pd.Series(list1)
s3

In [None]:
def printlist(s3):
    list1 = []
    for i in s3:
        for j in i :
            list1.append(j)
    return list1
printlist(s3)

In [None]:
s3.apply(printlist)

In [None]:
s1.sort_values()

In [None]:
# list(s1)
type(s1)
# s1.tolist()

# Pandas - Assignment no 02
- Here is link to [Pandas -Assignment no 02]()

## Google Play Store Apps Exploratory Data Analysis
#### Introduction
Google Play Store or formerly Android Market, is a digital distribution service developed and operated by Google. It is an official apps store that provides variety content such as apps, books, magazines, music, movies and television programs. It serves an as platform to allow users with 'Google certified' Android operating system devices to donwload applications developed and published on the platform either with a charge or free of cost. With the rapidly growth of Android devices and apps, it would be interesting to perform data analysis on the data to obtain valuable insights.

The dataset that is going to be used is `Google Play Store Apps` from Kaggle. It contains 10k of web scraped Play Store apps data for analysing the Android market. The tools that are going to be used for this `EDA` would be `numpy`, `pandas`, `matplotlib` and `seaborn`.

### Step 1 : Data Preparation and Data Cleaning
Here, we will be loading the Google Store Apps data stored in csv using pandas which is a fast and powerful python library for data analysis and easy data manipulation in pandas DataFrame object. It is usually used for working with tabular data (e.g data in spreadsheet) in various formats such as CSV, Excel spreadsheets, HTML tables, JSON etc. We will then perform some data preparation and also cleaning on it. We can download our dataset from this [address](https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/googleplaystore.csv)

In [None]:
url = "https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/googleplaystore.csv"


In [None]:
# import the necessary libraries
# allow matplotlib to plot inline with frontends like Jupyter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# load the apps and reviews data into pandas dataframe
df = pd.read_csv(url)

In [None]:
# look at the first 10 records in the apps dataframe
df.head(10)

In [None]:
df.columns

In [None]:
# look at the random 10 records in the apps dataframe
df.sample(10)

#### Description of App Dataset columns
- App : The name of the app
- Category : The category of the app
- Rating : The rating of the app in the Play Store
- Reviews : The number of reviews of the app
- Size : The size of the app
- Install : The number of installs of the app
- Type : The type of the app (Free/Paid)
- The price of the app (0 if it is Free)
- Content Rating :The appropiate target audience of the app
- Genres: The genre of the app
- Last Updated : The date when the app was last updated
- Current Ver : The current version of the app
- Android Ver : The minimum Android version required to run the app

In [None]:
# look at the info of the dataframe
df.info()

#### By diagnosing the data frame, we know that:
- There are 13 columns of properties with 10841 rows of data.
- Column 'Reviews', 'Size', 'Installs' and 'Price' are in the type of 'object'
- Values of column 'Size' are strings representing size in 'M' as Megabytes, 'k' as kilobytes and also 'Varies with devices'.
- Values of column 'Installs' are strings representing install amount with symbols such as ',' and '+'.
- Values of column 'Price' are strings representing price with symbol '$'.

In [None]:
# type of Category
# df.columns
# df.Category.unique()
# df.Category.nunique()

In [None]:
# type of Type
# df.columns
# df.Type
# df.Type.unique()

In [None]:
# type of Content Rating
# df.columns
# df['Content Rating'].unique()

In [None]:
# type of Genres
# df.Genres.nunique()

### Data Cleaning
- Clean the 'Reviews' data and change the type 'object' to 'float'.(Hint : astype() and use explicit function)

In [None]:
# list1 = [i for i in df.Reviews]
# reviews_clean(list1)
# def reviews_clean(reviews_list):
#     cleaned_list = []
#     for review in reviews_list:
#         if 'M' in review:
#             review = review.replace('M','')
#             review = float(review)*1000000 # 1M = 1000000
#         cleaned_list.append(review)
#     return cleaned_list
# df['Reviews'] = reviews_clean(list1)
# df['Reviews'] = df.Reviews.astype(float)

- clean the 'Size' data and change the type 'object' to 'float'. Found value with '1,000+' in one of record, remove it from data_frame as uncertain whether it is 'M' or 'k'

In [None]:
# index = df[df.Size == '1,000+'].index
# df.drop(axis=0, inplace=True, index=index)

In [None]:
# sizes = [i for i in df.Size]
# def cleaned_size(size):
#     cleaned_list = []
#     for i in size:
#         if 'M' in i:
#             i = i.replace('M','')
#             i = float(i)
#         elif 'k' in i:
#             i = i.replace('k','')
#             i = float(i)
#             i = i/1024  # 1Megabtye = 1024 kilobytes
#         elif "Varies with device" in i :
#             i = float(0)
#         cleaned_list.append(i)
#     return cleaned_list
# df['Size'] = cleaned_size(sizes)


In [None]:
# df['Size'] = df['Size'].astype(float)

- Clean the 'Installs' data and change the type 'object' to 'float'.

In [None]:
# installs =[i for i in df.Installs]
# installs

In [None]:
def cleaned_installs(data):
    cleaned_list = []
    for install in data:
        if ',' in install:
            install = install.replace(',','')
        if '+' in install:
            install = install.replace('+','')
        cleaned_list.append(int(install))
    return cleaned_list

df.Installs = cleaned_installs(installs)
df.Installs = df.Installs.astype('int')

- Clean the 'Price' data and change the type 'object' to 'float'.

In [None]:
prices = [i for i in df.Price]
# prices
def cleaned_Prices(data):
    cleaned_list = []
    for price in data:
        if '$' in price:
            price = price.replace('$','')
        cleaned_list.append(price)
    return cleaned_list
# cleaned_Prices(prices)

In [None]:
df.Price = cleaned_Prices(prices)
df.Price = df.Price.astype(float)
df.Price

In [None]:
# look at the random 10 records in the apps dataframe to verify the cleaned columns
df.sample(10)

In [None]:
# check on null values
df.isna().sum()

Here, we realized that there are 1474 rows having null values under column `Rating`. Hence, we decided to replace the null values with median of overall `Rating` values.

In [None]:
# df.Rating
# df.Rating.median()
# df['Rating'] = df.Rating.fillna(df.Rating.median())

In [None]:
df.isna().sum()

In [None]:
# remove the record where 'Type' is having null value

In [None]:
index = df[df.Type.isna()].index
df.drop(axis=0, inplace=True, index=index)

In [None]:
df.isna().sum()

In [None]:
# check on statistical information of the dataframe
df.describe()

In [None]:
# sort the dataframe by Reviews descendingly
df.sort_values(by='Reviews',ascending=False,  inplace=True)

In [None]:
df.head()

In [None]:
# drop duplicate rows based on App 
df.drop_duplicates(subset=['App'], inplace=True)

In [None]:
df.head()

### Exploratory Analysis and Visualization

In [None]:
# get the number of apps for each category
# first of all we get all the categories
df.Category.unique()

# second, we get count of each category
df.Category.value_counts()

### Bar Plot

In [None]:
from collections import Counter
counts = Counter(df.Category)
counts = dict(counts)
counts

In [None]:
# labels = list(counts.keys())
# values = list(counts.values())
# # labels,values

In [None]:
# fig = plt.figure(figsize=(10,10))
# ax = fig.add_subplot() 
# ax.bar(x=labels, height=values)
# ax.set_xlabel("Number of Apps")
# ax.set_ylabel("Category")
# plt.title("Number of Apps per Category")
# plt.show()

In [None]:
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot()
ax.barh(y =labels, width=values)
ax.set_xlabel("Number of Apps")
ax.set_ylabel("Category")
plt.title("Number of Apps per Category")
sns.set_style('dark')
plt.show()

### Countplot

In [None]:
# plt.figure(figsize=(10,5))
# sns.set_style('darkgrid')
# sns.countplot(x = df.Category, data=df)
# plt.title("Number of Apps per Category")
# plt.ylabel("Number of Apps")
# plt.xticks(rotation=90)
# plt.show()

**From this plotting we know that most of the apps in the play store are from the categories of `Family`, `Game` and also `Tools`.**

In [None]:
# get the number of installs for each category
# Firstly, we groupby the data according to category and installs
df.columns
categories = df.groupby(['Category'])
new_df = categories['Installs'].sum().reset_index()
new_df

In [None]:
# bar plot using matplotlib
fig = plt.figure(figsize=(10,5))
ax = fig.add_subplot()
ax.bar(x=new_df.Category, height=new_df.Installs)
ax.set_xlabel("Category")
ax.set_ylabel("Installs (e+10)")
ax.set_title("Number of Installs for each Category")
plt.xticks(rotation=90)
plt.show()

In [None]:
# barplot using seaborn
plt.figure(figsize=(10,5))
sns.barplot(x= new_df.Category, y=new_df.Installs)
plt.title("Number of Installs for each Category")
plt.xticks(rotation=90)
plt.show()

From this distribution plotting of number of installs for each category, we can see that most of the apps being downloaded and installed are from the categories of `Game` and `Communication`.

In [None]:
# show the distribution of rating using matplotlib and seaborn

In [None]:
plt.figure(figsize=(10,5))
plt.hist(df.Rating, bins=30)
plt.xlabel("Rating")
plt.ylabel("Number of Apps")
plt.show()

In [None]:
# using seaborn
plt.figure(figsize=(10, 5))
sns.histplot(df.Rating)
plt.title('Rating Distribution')
plt.xticks(rotation=90)
plt.ylabel('Number of Apps')
plt.show()

In [None]:
# using seaborn
plt.figure(figsize=(10, 5))
sns.countplot(x = 'Rating', data=df)
plt.title('Rating Distribution')
plt.xticks(rotation=90)
plt.ylabel('Number of Apps')
plt.show()

In [None]:
# plot the line graphs of reviews, size, installs and price per rating
new_df1 = df.groupby('Rating').sum().reset_index()
new_df1

In [None]:
new_df1.corr()

In [None]:
sns.heatmap(new_df1.corr())

In [None]:
fig = plt.figure()
fig, axes = plt.subplots(1, 4, figsize=(14, 4))

In [None]:
new_df1.columns

In [None]:
fig, axes = plt.subplots(1, 4, figsize=(14, 4))
axes[0].plot('Rating','Reviews', data=new_df1)
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Reviews')
axes[0].set_title('Reviews Per Rating')


axes[1].plot('Rating','Size', data=new_df1, color='g')
axes[1].set_xlabel('Rating')
axes[1].set_ylabel('Size')
axes[1].set_title('Size Per Rating')

axes[2].plot('Rating','Installs', data=new_df1, color='r')
axes[2].set_xlabel('Rating')
axes[2].set_ylabel('Size')
axes[2].set_title('Installs Per Rating')
plt.tight_layout(pad=2)

axes[3].plot('Rating','Price', data=new_df1, color='k')
axes[3].set_xlabel('Rating')
axes[3].set_ylabel('Price')
axes[3].set_title('Price Per Rating')
plt.tight_layout(pad=2)

plt.show()

**From the above plottings, we can imply that most of the apps with higher rating range of `4.0 - 4.7` are having high amount of reviews, size, and installs. In terms of price, it doesn't reflect a direct relationship with rating, as we could see a fluctuation in term of pricing even at the range of high rating.**

In [None]:
# application type distribution means Free or paid using matplotlib and seaborn
type_df = df.Type.value_counts().reset_index()
type_df

In [None]:
# using matplotlib
plt.figure(figsize=(7,8))
plt.pie(x=type_df.Type,labels=type_df['index'].values,autopct='%.2f%%' , shadow=True)
plt.legend()
plt.show()

In [None]:
# using seaborn
plt.figure(figsize=(7,5))
sns.countplot(x='Type', data=df)
plt.title('Type Distribution')
plt.ylabel('Number of Apps')
plt.show()

**From the plot we can imply that majority of the apps in the Play Store are Free apps.**

## Asking and Answering Questions
#### 1. What is the top 5 apps on the basis of installs?
#### 2. What is the top 5 reviewed apps?
#### 3. What is the top 5 expensive apps?
#### 4. What is the top 3 most installed apps in Game category?
#### 5. Which 5 apps from the 'FAMILY' category are having the lowest rating?