# Pandas Deep-Dive



#### 1. Import Pandas package under the alias pd and Numpy under the alias np.

In [109]:
import pandas as pd
import numpy as np 


#### 2. Use the `apple_store.sql` file to create a new database in mySQL workbench with that data. Once loaded, try to answer the following questions using mySQL workbench:

 - How many apps are there in the data source?
 - What is the average rating of all apps?
 - How many apps have an average rating no less than 4?
 - How many genres are there in total for all the apps?
 - Which genre is most likely to contain free apps?

In [None]:
# use new_schema;
# select * from apple_store;
# select count(distinct track_name) as distinct_track_count
# from apple_store;

# select avg(user_rating_ver) from apple_store group by track_name;

# SELECT COUNT(DISTINCT track_name) 
# FROM apple_store 
# WHERE user_rating >= 4;

# select count(distinct prime_genre) from apple_store;

# SELECT prime_genre, COUNT(track_name) AS number_apps
# FROM apple_store
# WHERE price = 0
# GROUP BY prime_genre
# ORDER BY number_apps DESC;





# How many apps are there in the data source?
# 7195 apps


# What is the average rating of all apps?

# user rating: 3.52
# user rating_ver: 3.25


# How many apps have an average rating no less than 4?
# 4781 apps


# How many genres are there in total for all the apps?
# 23 genres


# Which genre is most likely to contain free apps?
# Games (2267 apps)



#### 3. Create a SQL connection in this notebook and load the `apple_store` dataset. Assign it to a variable called `data`, which would be a pandas dataframe.

In [110]:
from sqlalchemy import create_engine
import getpass
saved_password = getpass.getpass()

  


········


In [111]:
engine = create_engine("mysql+pymysql://{user}:{pw}@localhost/{db}"
                       .format(user="root",
                               pw=saved_password,
                               db="apple_store"))

In [112]:

query = "SELECT * FROM Publicationss.apple_store;"
data = pd.read_sql_query(query, engine)

print(data)



              id                                         track_name  \
0     1188375727                       Escape the Sweet Shop Series   
1      281656475                                    PAC-MAN Premium   
2      281796108                          Evernote - stay organized   
3      281940292  "WeatherBug - Local Weather, Radar, Maps, Alerts"   
4      282614216  "eBay: Best App to Buy, Sell, Save! Online Sho...   
...          ...                                                ...   
7206  1187617475                                              Kubik   
7207  1187682390                                  VR Roller-Coaster   
7208  1187838770          VR Roller Coaster World - Virtual Reality   
7209  1187838770          VR Roller Coaster World - Virtual Reality   
7210  1188375727                       Escape the Sweet Shop Series   

     size_bytes price rating_count_tot rating_count_ver user_rating  \
0      90898432     0                3                3           5   
1    

#### 4. Print the first 5 rows of `data` to see what the data look like.

A data analyst usually does this to have a general understanding about what the data look like before digging deep.

In [30]:
data.head()

Unnamed: 0,id,track_name,size_bytes,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,prime_genre
0,1188375727,Escape the Sweet Shop Series,90898432,0.0,3,3,5.0,5.0,Games
1,281656475,PAC-MAN Premium,100788224,3.99,21292,26,4.0,4.5,Games
2,281796108,Evernote - stay organized,158578688,0.0,161065,26,4.0,3.5,Productivity
3,281940292,"""WeatherBug - Local Weather, Radar, Maps, Alerts""",100524032,0.0,188583,2822,3.5,4.5,Weather
4,282614216,"""eBay: Best App to Buy, Sell, Save! Online Sho...",128512000,0.0,262241,649,4.0,4.5,Shopping


#### 5. Print the summary of the data.

In [36]:

data.shape

(7211, 9)

In [37]:
data.describe()

Unnamed: 0,id,track_name,size_bytes,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,prime_genre
count,7211,7211,7211,7211,7211,7211,7211.0,7211.0,7211
unique,7197,7195,7107,36,3185,1138,10.0,10.0,23
top,1188375727,Escape the Sweet Shop Series,90898432,0,0,0,4.5,4.5,Games
freq,4,4,4,4066,931,1447,2668.0,2210.0,3876


#### 6. Print the number of columns in the data.

In [39]:

len(data.columns)

9

#### 7. Print all column names.

In [40]:
 
data.columns

Index(['id', 'track_name', 'size_bytes', 'price', 'rating_count_tot',
       'rating_count_ver', 'user_rating', 'user_rating_ver', 'prime_genre'],
      dtype='object')

#### 8.- Now that we have a general understanding of the data, we'll start working on the challenge questions. How many apps are there in the data source? Print the number of observations of the data.

**Hint**: Your code should return the number 7197.

In [54]:
apps = data["id"].value_counts()
print("number of apps;", len(apps))




number of apps; 7197


#### 9. What is the average rating of all apps? 

First, read the `user_rating` column into a varialbe named `user_rating`.

In [57]:
user_rating = data['user_rating']
user_rating

0         5
1         4
2         4
3       3.5
4         4
       ... 
7206    4.5
7207    4.5
7208    4.5
7209    4.5
7210      5
Name: user_rating, Length: 7211, dtype: object

Now you can calculate the average of the `user_rating` data.

**Hint**: Your code should return 3.526955675976101.

In [63]:
import pandas as pd


data['user_rating'] = pd.to_numeric(data['user_rating'], 
avg_user_rating = data['user_rating'].mean()
print("Average user rating:", avg_user_rating)



Average user rating: 3.5271113576480375


#### 10. How many apps have an average rating no less than 4?

First, filter `user_rating` where its value >= 4. 

Assign the filtered dataframe to a new variable called `user_rating_high`.

In [64]:
user_rating_high = data[data['user_rating']>=4]
user_rating_high

Unnamed: 0,id,track_name,size_bytes,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,prime_genre
0,1188375727,Escape the Sweet Shop Series,90898432,0,3,3,5.0,5,Games
1,281656475,PAC-MAN Premium,100788224,3.99,21292,26,4.0,4.5,Games
2,281796108,Evernote - stay organized,158578688,0,161065,26,4.0,3.5,Productivity
4,282614216,"""eBay: Best App to Buy, Sell, Save! Online Sho...",128512000,0,262241,649,4.0,4.5,Shopping
5,282935706,Bible,92774400,0,985920,5320,4.5,5,Reference
...,...,...,...,...,...,...,...,...,...
7206,1187617475,Kubik,126644224,0,142,75,4.5,4.5,Games
7207,1187682390,VR Roller-Coaster,120760320,0,30,30,4.5,4.5,Games
7208,1187838770,VR Roller Coaster World - Virtual Reality,97235968,0,85,32,4.5,4.5,Games
7209,1187838770,VR Roller Coaster World - Virtual Reality,97235968,0,85,32,4.5,4.5,Games


Now obtain the length of `user_rating_high` which should return 4781.

In [71]:

count = user_rating_high.nunique()
print("length of user_rating high:", count)

length of user_rating high: id                  4781
track_name          4781
size_bytes          4745
price                 33
rating_count_tot    2845
rating_count_ver    1096
user_rating            3
user_rating_ver       10
prime_genre           23
dtype: int64


#### 11. How many genres are there in total for all the apps?

Define a new variable named `genres` that contains the `prime_genre` column of `data`. Google for how to obtain unique values of a dataframe column. 

In [73]:
genres = data['prime_genre'].unique()
genres

array(['Games', 'Productivity', 'Weather', 'Shopping', 'Reference',
       'Finance', 'Music', 'Utilities', 'Travel', 'Social Networking',
       'Sports', 'Business', 'Health & Fitness', 'Entertainment',
       'Photo & Video', 'Navigation', 'Education', 'Lifestyle',
       'Food & Drink', 'News', 'Book', 'Medical', 'Catalogs'],
      dtype=object)

Print the length of the unique values of `genres`. Your code should return 23.

In [76]:
unique_genres = data['prime_genre'].nunique()
print('lenght of the unique values of genres:', unique_genres)

lenght of the unique values of genres: 23


#### 12. What are the top 3 genres that have the most number of apps?

What you want to do is to count the number of occurrences of each unique genre values. Because you already know how to obtain the unique genre values, you can of course count the # of apps of each genre one by one. However, Pandas has a convient function to let you count all values of a dataframe column with a single command. Google for "pandas count values" to find the solution. Your code should return the following:

```
Games            3862
Entertainment     535
Education         453
Name: prime_genre, dtype: int64
```

In [83]:
genre_values = data['prime_genre']
genre_values.value_counts().head(3)

prime_genre
Games            3876
Entertainment     535
Education         453
Name: count, dtype: int64

#### 13. Which genre is most likely to contain free apps?

First, filter `data` where the price is 0.00. Assign the filtered data to a new variable called `free_apps`. Then count the values in `free_apps`. Your code should return:

```
Games                2257
Entertainment         334
Photo & Video         167
Social Networking     143
Education             132
Shopping              121
Utilities             109
Lifestyle              94
Finance                84
Sports                 79
Health & Fitness       76
Music                  67
Book                   66
Productivity           62
News                   58
Travel                 56
Food & Drink           43
Weather                31
Navigation             20
Reference              20
Business               20
Catalogs                9
Medical                 8
Name: prime_genre, dtype: int64
```

In [119]:
data['price'] = pd.to_numeric(data['price'], errors='coerce')

free_apps = data[data['price'] == 0]

genre_counts = free_apps['prime_genre'].value_counts()
genre_counts

prime_genre
Games                2267
Entertainment         334
Photo & Video         167
Social Networking     143
Education             132
Shopping              121
Utilities             109
Lifestyle              94
Finance                84
Sports                 79
Health & Fitness       76
Music                  67
Book                   66
Productivity           62
News                   58
Travel                 56
Food & Drink           43
Weather                31
Business               20
Reference              20
Navigation             20
Catalogs                9
Medical                 8
Name: count, dtype: int64

#### 14. Now you can calculate the proportion of the free apps in each genre based on the value counts you obtained in the previous two steps. 

Challenge yourself by achieving that with one line of code. The output should look like:

```
Shopping             0.991803
Catalogs             0.900000
Social Networking    0.856287
Finance              0.807692
News                 0.773333
Sports               0.692982
Travel               0.691358
Food & Drink         0.682540
Lifestyle            0.652778
Entertainment        0.624299
Book                 0.589286
Games                0.584412
Music                0.485507
Photo & Video        0.478510
Utilities            0.439516
Navigation           0.434783
Weather              0.430556
Health & Fitness     0.422222
Business             0.350877
Productivity         0.348315
Medical              0.347826
Reference            0.312500
Education            0.291391
Name: prime_genre, dtype: float64
```

The numbers are interesting, aren't they?

In [123]:
proportion_free_apps = (genre_counts / genre_counts.sum()).sort_values(ascending=False)
print(proportion_free_apps)


prime_genre
Games                0.557550
Entertainment        0.082145
Photo & Video        0.041072
Social Networking    0.035170
Education            0.032464
Shopping             0.029759
Utilities            0.026808
Lifestyle            0.023119
Finance              0.020659
Sports               0.019429
Health & Fitness     0.018692
Music                0.016478
Book                 0.016232
Productivity         0.015248
News                 0.014265
Travel               0.013773
Food & Drink         0.010576
Weather              0.007624
Business             0.004919
Reference            0.004919
Navigation           0.004919
Catalogs             0.002213
Medical              0.001968
Name: count, dtype: float64


#### 15. If a developer tries to make money by developing and selling Apple Store apps, in which genre should s/he develop the apps? Please assume all apps cost the same amount of time and expense to develop.

We will leave this question to you. There are several way to solve it. Ideally your output should look like below:

```
    average_price              genre
21       8.776087            Medical
11       5.116316           Business
4        4.836875          Reference
6        4.835435              Music
1        4.330562       Productivity
15       4.124783         Navigation
16       4.028234          Education
12       1.916444   Health & Fitness
20       1.790536               Book
7        1.647621          Utilities
2        1.605417            Weather
18       1.552381       Food & Drink
14       1.473295      Photo & Video
0        1.432923              Games
8        1.120370             Travel
10       0.953070             Sports
13       0.889701      Entertainment
17       0.885417          Lifestyle
22       0.799000           Catalogs
19       0.517733               News
5        0.421154            Finance
9        0.339880  Social Networking
3        0.016311           Shopping
```

In [115]:

average_price_by_genre = data.groupby('prime_genre')['price'].mean()
result_df = pd.DataFrame({'average_price': average_price_by_genre})
result_df = result_df.sort_values(by='average_price', ascending=False)

print(result_df[['average_price']])



                   average_price
prime_genre                     
Medical                 8.776087
Business                5.116316
Reference               4.836875
Music                   4.835435
Productivity            4.330562
Navigation              4.124783
Education               4.028234
Health & Fitness        1.916444
Book                    1.790536
Utilities               1.647621
Weather                 1.605417
Food & Drink            1.552381
Photo & Video           1.473295
Games                   1.430317
Travel                  1.120370
Sports                  0.953070
Entertainment           0.889701
Lifestyle               0.885417
Catalogs                0.799000
News                    0.517733
Finance                 0.421154
Social Networking       0.339880
Shopping                0.016311


# Challenge - Applying Functions to DataFrames

#### Our next step is to use the apply function to a dataframe and transform all cells.

To do this, we will load a dataset below and then write a function that will perform the transformation.

In [125]:
# Run this code:

# The dataset below contains information about pollution from PM2.5 particles in Beijing 

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv"
pm25 = pd.read_csv(url)

Let's look at the data using the head() function.

In [126]:
pm25.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


In [128]:
pm25

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
43819,43820,2014,12,31,19,8.0,-23,-2.0,1034.0,NW,231.97,0,0
43820,43821,2014,12,31,20,10.0,-22,-3.0,1034.0,NW,237.78,0,0
43821,43822,2014,12,31,21,10.0,-22,-3.0,1034.0,NW,242.70,0,0
43822,43823,2014,12,31,22,8.0,-22,-4.0,1034.0,NW,246.72,0,0


The next step is to create a function that divides a cell by 24 to produce an hourly figure. Write the function below.

In [None]:
def hourly(x):
    '''
    Input: A numerical value
    Output: The value divided by 24
        
    Example:
    Input: 48
    Output: 2.0
    '''
    
    # Your code here:
    

Apply this function to the columns Iws, Is, and Ir. Store this new dataframe in the variable pm25_hourly.

In [None]:
# Your code here:

#### Our last challenge will be to create an aggregate function and apply it to a select group of columns in our dataframe.

Write a function that returns the standard deviation of a column divided by the length of a column minus 1. Since we are using pandas, do not use the `len()` function. One alternative is to use `count()`. Also, use the numpy version of standard deviation.

In [None]:
def sample_sd(x):
    '''
    Input: A Pandas series of values
    Output: the standard deviation divided by the number of elements in the series, minus 1 
        
    Example:
    Input: pd.Series([1,2,3,4])
    Output: 0.3726779962
    '''
    
    # Your code here: