---   

<h1 align="center">Introduction to Data Analyst and Data Science for beginners</h1>
<h1 align="center">Lecture no 2.18(Pandas-09)</h1>

---
<h3><div align="right">Ehtisham Sadiq</div></h3>    

## _Aggregating and Grouping Dataframes.ipynb_

### Our Main Problem :
Here is dataset and find minimum temperature of each city in the dataset

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupbydata2.csv')
df

<h3><div align="right">Grouping Dataframes</div></h3>  
<img align="left" width="1000" height="1000"  src="images/groupbyfinal.png"  >

In [None]:
# Aggregation means to collect

## Learning agenda of this notebook
1. Overview of Aggregation Functions and the `agg()` method
    - Applying a Built-in Aggregation Function on Entire Dataframe Object
    - Applying a Built-in Aggregation Function on a Series Object
    - Applying a User-Defined/Lambda Function on a Series Object<br><br>
2. Computing the Minimum Temperature of each City using **hard way**<br><br>
3. Computing the Minimum Temperature of each City using **`groupby`**<br><br>
4. Practice GroupBy on Stack Overflow Survey Dataset

## 1. Overview of Aggregation Functions and the `agg()` Method
- An aggregation function is one which takes multiple individual values and returns a result.

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupbydata2.csv')
df

### a. Applying a Built-in Aggregation Function on Entire Dataframe Object

In [None]:
df.min()

In [None]:
df.count()

In [None]:
# Should be applied to numeric columns only, may raise a warning
df.median()

In [None]:
df.median(numeric_only=True)

> We can call the `agg()` method on the dataframe to apply multiple aggregation functions at a time, by passing the `agg()` function a list of aggregation functions as strings.

In [None]:
df.agg(['min', 'max',  'count'])

> We can call the `describe()` method on the dataframe to get descriptive statistical measures on all its numeric columns.

In [None]:
df.describe()

### b. Applying a Built-in Aggregation Function on a Series Object

In [None]:
df['temperature'].min()

In [None]:
df['temperature'].max()

In [None]:
df['temperature'].mean()

> We can call the `agg()` method on a series to apply multiple aggregation functions at a time, by passing the `agg()` function a list of aggregation functions as strings.

In [None]:
df['temperature'].agg(['min', 'max', 'mean', 'count'])

> We can call the `describe()` method on the dataframe to get descriptive statistical measures on all its numeric columns.

In [None]:
df['temperature'].describe()

### c. Applying a User-Defined/Lambda Function on a Series Object using the `apply()` Method
- We have used this `apply()` method before as well that is used to invoke function on values of Series and return a resulting series.

In [None]:
df.temperature

In [None]:
def ctof(x):
    return x*9/5+32

df.temperature.apply(ctof)

In [None]:
df.temperature.apply(lambda x: x*9/5+32)

# How to Compute the Minimum Temperature of Each City?

## 2. Doing it the Hard Way
<img align="center" width="700" height="500"  src="images/groupbyfinal.png"  >

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupbydata1.csv')
df

### a. Splitting the Dataframe
- We need to use conditional selection technique, in which we pass a Boolean mask for the appropriate city column to be selected. Can do it using two ways:
    - Using `df[]` subscript operator
    - Using `df.loc` method

In [None]:
df[df['city']=='karachi']

In [None]:
df[df['city']=='lahore']
df.loc[df.city=='lahore', :]

In [None]:
df[df['city']=='karachi']
df.loc[df.city=='karachi', :]

In [None]:
df[df['city']=='murree']
df.loc[df.city=='murree', :]

>**Limitation:**
>- We have to repeat this process for every city separately.
>- What if there are over 100 cities in the dataset?

### b. Applying the `min()` Function
- We need to apply the `min()` function on the temperature column of all of the above dataframes separately

In [None]:
df.loc[df.city=='lahore', :].temperature.min()

In [None]:
df.loc[df.city=='lahore', :].temperature.min()

In [None]:
df.loc[df.city=='karachi', :].temperature.min()

In [None]:
df.loc[df.city=='murree', :].temperature.min()

>**Limitation:**
>- We have to repeat this process for every city separately.
>- What if there are over 100 cities in the dataset?

### c. Combining the Result
- Since, we have got the minimum temperature of all the cities, we need to combine them to an appropriate series object to be used for later processing.

In [None]:
lhr = df.loc[df.city=='lahore', :].temperature.min()
kci = df.loc[df.city=='karachi', :].temperature.min()
murree = df.loc[df.city=='murree', :].temperature.min()

s = pd.Series(data=[lhr, kci, murree], index=['L_min', 'K_min', 'M_min'] )
s.name= 'Min Temperatures'
s

# How to Compute the Minimum Temperature of Each City?

## 3. An Elegant Way
<img align="center" width="700" height="500"  src="images/groupbyfinal.png"  >

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupbydata1.csv')
df

### a. Step 1: Split Step
- In the split step we divide the data inside the dataframe into multiple groups
- Since we need to calculate the minimum temperature of each city, therefore, we will use `groupby()` method on the `city` column of the dataframe.
- This will result a DataFrameGroupBy object, which is an iterable containing multiple small dataframes based on the `by` argument passed to the `groupby()` method

In [None]:
dfgb = df.groupby('city')
dfgb

>- Since this is an iterable, so let us iterate :)

In [None]:
for mydf in dfgb:
    print(mydf)

>- To display indices of every group in the dataframe, use `groups` attribute of  `DataFrameGroupBy` object.
>- Returns a Dictionary object (PrettyDict) with keys as the group value and value as list of corresponding row indices

In [None]:
dfgb.groups   # df.groupby('city').groups

>- To display records of a specific group, use `get_group()` method on `DataFrameGroupBy` object.
>- Construct and return DataFrame from `DataFrameGroupBy` object  with provided name.

In [None]:
# Display DataFrame of a specific group from groupby object by providing the specific group value
dfgb.get_group('murree') # df.groupby('city').get_group('karachi') 

>- To find the size of each group, use `size()` method of DataFrameGroupBy object.
>- It return a series containing number of rows in each each group of the DataFrameGroupBy object as a Series

In [None]:
dfgb.size()  #df.groupby('city').size()

> After understanding the `groupby()` method let us move to step 2, and that is `Applying a Function`

### b. Step 2: Apply Step
- Now second step is that we apply appropriate aggregate function on all the groups inside the DataFrameGroupBy object

**Let us first apply aggregate function on a specific column of `DataFrameGroupBy` object, which is a `SeriesGroupBy` object**

In [None]:
df

In [None]:
df.groupby('city')

In [None]:
df.groupby('city').get_group('lahore')

In [None]:
df.groupby('city').get_group('lahore').temperature.min()

In [None]:
df.groupby('city').get_group('lahore').temperature.min()

In [None]:
df.groupby('city').get_group('karachi').temperature.min()

In [None]:
df.groupby('city').get_group('murree').temperature.min()

### b. Step 3: Combine Step
- Now we have got minimum temperature of all the three cities, let us combine the result into a series object

In [None]:
kci = df.groupby('city').get_group('karachi').temperature.min()
lhr = df.groupby('city').get_group('lahore').temperature.min()
murree = df.groupby('city').get_group('murree').temperature.min()

s1 = pd.Series(data=[kci, lhr, murree], index=['K_min', 'L_min', 'M_min'] )
s1.name= 'Min Temperatures'
s1

>- **Let us perform the `apply + combine` steps in one go, by applying the `min()` function on the temperature series of all the dataframes inside the DataFrameGroupBy object.**
>- **This saves us from the hassle of applying `min()` method explicitly as done above**

In [None]:
df.groupby('city')

In [None]:
df.groupby('city').temperature

In [None]:
df.groupby('city').temperature.min()

>- **We can also apply `agg()` method on the temperature series of all the dataframes inside the DataFrameGroupBy object**

In [None]:
df.groupby('city').temperature.agg(['min', 'max', 'sum', 'mean'])

>-Note that we have got a dataframe this time

## 4. Practice GroupBy on Stack Overflow Survey Dataset
Visit to Download Data: https://insights.stackoverflow.com/survey/

### a. Understand the Data Set

In [None]:
import pandas as pd
df = pd.read_csv('datasets/so_survey_subset.csv', index_col='Respondent')
df.shape

In [None]:
df.head()

In [None]:
df.loc[df['Country']=='Pakistan', :]

In [None]:
import pandas as pd
schema = pd.read_csv('datasets/so_survey_subset_schema.csv', index_col='Column')
schema

In [None]:
schema.loc['Hobbyist']

In [None]:
df['Hobbyist']

In [None]:
schema.loc['Country']

In [None]:
df['Country']

In [None]:
schema.loc['ConvertedComp']

In [None]:
df['ConvertedComp']

In [None]:
schema.loc['LanguageWorkedWith']

In [None]:
!cat datasets/so_survey_subset_schema.csv

In [None]:
df['LanguageWorkedWith']

In [None]:
schema.loc['SocialMedia']

In [None]:
df['SocialMedia']

In [None]:
df

##### Let us perform some basic statistical analysis on the Dataset

In [None]:
# Returns the count of non-NA values for a series object.
df['Hobbyist'].count()

In [None]:
# Returns a Series containing counts of unique rows.
df['Hobbyist'].value_counts()

In [None]:
# Returns the count of non-NA values for a series object.
df['Country'].count()

In [None]:
# Returns a Series containing counts of unique rows.
df['Country'].value_counts()

In [None]:
# To get the count of countries whose developers participated in the survey
df['Country'].value_counts().count()

In [None]:
# Returns the count of non-NA values for a series object.
df['ConvertedComp'].count()

In [None]:
# Returns a Series containing counts of unique rows.
df['ConvertedComp'].value_counts()

In [None]:
df['ConvertedComp'].mean()

In [None]:
df['ConvertedComp'].median()

In [None]:
df.describe()

<h1 align="center">Let us try answering certain Questions</h1>

##  Question 1: 
>**List the most popular SocialMedia web site for every Country**

**Let us first  do the easy task, and that is to list the most popular SocialMedia website of a single country (lets say Pakistan)**

In [None]:
df

In [None]:
df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts()

In [None]:
df.loc[df.Country =='Pakistan', :]
df.loc[df.Country =='Pakistan', 'SocialMedia'].head(10)
df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts()
df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts(normalize=True)
df.loc[df.Country =='China', 'SocialMedia'].value_counts()

In [None]:
df.groupby('Country')

In [None]:
df.groupby('Country').get_group("Pakistan")

In [None]:
df.groupby('Country').get_group("Pakistan").loc[:, 'SocialMedia']

In [None]:
df.groupby('Country').get_group("Pakistan").loc[:, 'SocialMedia'].value_counts()

In [None]:
df.groupby('Country')['SocialMedia'].value_counts().head(60)

In [None]:
df.groupby('Country')['SocialMedia'].value_counts().head(50)

In [None]:
df.groupby('Country')['SocialMedia'].value_counts().head(50)

##  Question 2: 
>**What percentage of people in each country knows Python programming?**

**tc** = Total count of people from each country who participated in the survey?

**pc** = Python People: Count of people from each country who knows Python

**tc (option 1):**

In [None]:
df

In [None]:
df.loc[:, 'Country']

In [None]:
tc = df['Country'].value_counts()
tc.name = 'Total'
tc

**tc (option 2):**

In [None]:
dfgb = df.groupby('Country')
dfgb

In [None]:
df.groupby('Country')['Country']

In [None]:
df.groupby('Country')['Country'].apply(lambda x: x.value_counts())

**pc:**

In [None]:
df.loc[:, 'LanguageWorkedWith']

In [None]:
df.groupby('Country')['LanguageWorkedWith']

In [None]:
df.groupby('Country')['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python'))

In [None]:
pp = df.groupby('Country')['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
pp

In [None]:
pp.name = 'Knows Python'

**Create a Dataframe of two series tc and pp**

In [None]:
resultdf = pd.concat([tc, pp], axis=1)
resultdf

In [None]:
resultdf.loc['Pakistan']

**Percentage of people in each country knows Python?**

In [None]:
resultdf['Percentage'] = (resultdf['Knows Python'] / resultdf['Total']) * 100
resultdf

In [None]:
resultdf.loc['Pakistan']

## Check Your Concepts:
- What is Pandas?

## Practice Questions

# Pandas - Assignment no 09
- Here is link of Pandas - [Assignment no 09]()