---   

<h1 align="center">Introduction to Data Analyst and Data Science for beginners</h1>
<h1 align="center">Lecture no 2.18(Pandas-09)</h1>

---
<h3><div align="right">Ehtisham Sadiq</div></h3>    

## _Aggregating and Grouping Dataframes.ipynb_

### Our Main Problem :
Here is dataset and find minimum temperature of each city in the dataset

In [21]:
import pandas as pd
df = pd.read_csv('datasets/groupbydata2.csv')
# df

<h3><div align="right">Grouping Dataframes</div></h3>  
<img align="left" width="1000" height="1000"  src="images/groupbyfinal.png"  >

In [None]:
# Aggregation means to collect

## Learning agenda of this notebook
1. Overview of Aggregation Functions and the `agg()` method
    - Applying a Built-in Aggregation Function on Entire Dataframe Object
    - Applying a Built-in Aggregation Function on a Series Object
    - Applying a User-Defined/Lambda Function on a Series Object<br><br>
2. Computing the Minimum Temperature of each City using **hard way**<br><br>
3. Computing the Minimum Temperature of each City using **`groupby`**<br><br>
4. Practice GroupBy on Stack Overflow Survey Dataset

## 1. Overview of Aggregation Functions and the `agg()` Method
- An aggregation function is one which takes multiple individual values and returns a result.

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupbydata2.csv')
df

### a. Applying a Built-in Aggregation Function on Entire Dataframe Object

In [None]:
df.min()

In [None]:
df.count()

In [None]:
# Should be applied to numeric columns only, may raise a warning
df.median()

In [None]:
df.median(numeric_only=True)

> We can call the `agg()` method on the dataframe to apply multiple aggregation functions at a time, by passing the `agg()` function a list of aggregation functions as strings.

In [None]:
df.agg(['min', 'max',  'count'])

> We can call the `describe()` method on the dataframe to get descriptive statistical measures on all its numeric columns.

In [None]:
df.describe()

### b. Applying a Built-in Aggregation Function on a Series Object

In [None]:
df['temperature'].min()

In [None]:
df['temperature'].max()

In [None]:
df['temperature'].mean()

> We can call the `agg()` method on a series to apply multiple aggregation functions at a time, by passing the `agg()` function a list of aggregation functions as strings.

In [None]:
df['temperature'].agg(['min', 'max', 'mean', 'count'])

> We can call the `describe()` method on the dataframe to get descriptive statistical measures on all its numeric columns.

In [None]:
df['temperature'].describe()

### c. Applying a User-Defined/Lambda Function on a Series Object using the `apply()` Method
- We have used this `apply()` method before as well that is used to invoke function on values of Series and return a resulting series.

In [None]:
df.temperature

In [None]:
def ctof(x):
    return x*9/5+32

df.temperature.apply(ctof)

In [None]:
df.temperature.apply(lambda x: x*9/5+32)

# How to Compute the Minimum Temperature of Each City?

## 2. Doing it the Hard Way
<img align="center" width="700" height="500"  src="images/groupbyfinal.png"  >

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupbydata1.csv')
df

### a. Splitting the Dataframe
- We need to use conditional selection technique, in which we pass a Boolean mask for the appropriate city column to be selected. Can do it using two ways:
    - Using `df[]` subscript operator
    - Using `df.loc` method

In [None]:
df[df['city']=='karachi']

In [None]:
df[df['city']=='lahore']
df.loc[df.city=='lahore', :]

In [None]:
df[df['city']=='karachi']
df.loc[df.city=='karachi', :]

In [None]:
df[df['city']=='murree']
df.loc[df.city=='murree', :]

>**Limitation:**
>- We have to repeat this process for every city separately.
>- What if there are over 100 cities in the dataset?

### b. Applying the `min()` Function
- We need to apply the `min()` function on the temperature column of all of the above dataframes separately

In [None]:
df.loc[df.city=='lahore', :].temperature.min()

In [None]:
df.loc[df.city=='lahore', :].temperature.min()

In [None]:
df.loc[df.city=='karachi', :].temperature.min()

In [None]:
df.loc[df.city=='murree', :].temperature.min()

>**Limitation:**
>- We have to repeat this process for every city separately.
>- What if there are over 100 cities in the dataset?

### c. Combining the Result
- Since, we have got the minimum temperature of all the cities, we need to combine them to an appropriate series object to be used for later processing.

In [None]:
lhr = df.loc[df.city=='lahore', :].temperature.min()
kci = df.loc[df.city=='karachi', :].temperature.min()
murree = df.loc[df.city=='murree', :].temperature.min()

s = pd.Series(data=[lhr, kci, murree], index=['L_min', 'K_min', 'M_min'] )
s.name= 'Min Temperatures'
s

# How to Compute the Minimum Temperature of Each City?

## 3. An Elegant Way
<img align="center" width="700" height="500"  src="images/groupbyfinal.png"  >

In [None]:
import pandas as pd
df = pd.read_csv('datasets/groupbydata1.csv')
df

### a. Step 1: Split Step
- In the split step we divide the data inside the dataframe into multiple groups
- Since we need to calculate the minimum temperature of each city, therefore, we will use `groupby()` method on the `city` column of the dataframe.
- This will result a DataFrameGroupBy object, which is an iterable containing multiple small dataframes based on the `by` argument passed to the `groupby()` method

In [None]:
dfgb = df.groupby('city')
dfgb

>- Since this is an iterable, so let us iterate :)

In [None]:
for mydf in dfgb:
    print(mydf)

>- To display indices of every group in the dataframe, use `groups` attribute of  `DataFrameGroupBy` object.
>- Returns a Dictionary object (PrettyDict) with keys as the group value and value as list of corresponding row indices

In [None]:
dfgb.groups   # df.groupby('city').groups

>- To display records of a specific group, use `get_group()` method on `DataFrameGroupBy` object.
>- Construct and return DataFrame from `DataFrameGroupBy` object  with provided name.

In [None]:
# Display DataFrame of a specific group from groupby object by providing the specific group value
dfgb.get_group('murree') # df.groupby('city').get_group('karachi') 

>- To find the size of each group, use `size()` method of DataFrameGroupBy object.
>- It return a series containing number of rows in each each group of the DataFrameGroupBy object as a Series

In [None]:
dfgb.size()  #df.groupby('city').size()

> After understanding the `groupby()` method let us move to step 2, and that is `Applying a Function`

### b. Step 2: Apply Step
- Now second step is that we apply appropriate aggregate function on all the groups inside the DataFrameGroupBy object

**Let us first apply aggregate function on a specific column of `DataFrameGroupBy` object, which is a `SeriesGroupBy` object**

In [None]:
df

In [None]:
df.groupby('city')

In [None]:
df.groupby('city').get_group('lahore')

In [None]:
df.groupby('city').get_group('lahore').temperature.min()

In [None]:
df.groupby('city').get_group('lahore').temperature.min()

In [None]:
df.groupby('city').get_group('karachi').temperature.min()

In [None]:
df.groupby('city').get_group('murree').temperature.min()

### b. Step 3: Combine Step
- Now we have got minimum temperature of all the three cities, let us combine the result into a series object

In [None]:
kci = df.groupby('city').get_group('karachi').temperature.min()
lhr = df.groupby('city').get_group('lahore').temperature.min()
murree = df.groupby('city').get_group('murree').temperature.min()

s1 = pd.Series(data=[kci, lhr, murree], index=['K_min', 'L_min', 'M_min'] )
s1.name= 'Min Temperatures'
s1

>- **Let us perform the `apply + combine` steps in one go, by applying the `min()` function on the temperature series of all the dataframes inside the DataFrameGroupBy object.**
>- **This saves us from the hassle of applying `min()` method explicitly as done above**

In [None]:
df.groupby('city')

In [None]:
df.groupby('city').temperature

In [None]:
df.groupby('city').temperature.min()

>- **We can also apply `agg()` method on the temperature series of all the dataframes inside the DataFrameGroupBy object**

In [None]:
df.groupby('city').temperature.agg(['min', 'max', 'sum', 'mean'])

>-Note that we have got a dataframe this time

## 4. Practice GroupBy on Stack Overflow Survey Dataset
Visit to Download Data: https://insights.stackoverflow.com/survey/

### a. Understand the Data Set

In [1]:
import pandas as pd
df = pd.read_csv('datasets/so_survey_subset.csv', index_col='Respondent')
df.shape

(88883, 9)

In [2]:
df.head()

Unnamed: 0_level_0,MainBranch,Hobbyist,Country,YearsCode,ConvertedComp,LanguageWorkedWith,SocialMedia,Age,Gender
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,I am a student who is learning to code,Yes,United Kingdom,4.0,,HTML/CSS;Java;JavaScript;Python,Twitter,14.0,Man
2,I am a student who is learning to code,No,Bosnia and Herzegovina,,,C++;HTML/CSS;Python,Instagram,19.0,Man
3,"I am not primarily a developer, but I write co...",Yes,Thailand,3.0,8820.0,HTML/CSS,Reddit,28.0,Man
4,I am a developer by profession,No,United States,3.0,61000.0,C;C++;C#;Python;SQL,Reddit,22.0,Man
5,I am a developer by profession,Yes,Ukraine,16.0,,C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA,Facebook,30.0,Man


In [3]:
df.loc[df['Country']=='Pakistan', :]

Unnamed: 0_level_0,MainBranch,Hobbyist,Country,YearsCode,ConvertedComp,LanguageWorkedWith,SocialMedia,Age,Gender
Respondent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
84,I am a developer by profession,No,Pakistan,3,3468.0,C;C++;C#;Java;Kotlin;PHP;SQL,WhatsApp,26.0,Man
119,I am a developer by profession,No,Pakistan,10,,C;C++;C#;HTML/CSS;Java;JavaScript;SQL,Facebook,28.0,Man
298,I am a developer by profession,Yes,Pakistan,4,,HTML/CSS;JavaScript;PHP;SQL;Other(s):,LinkedIn,23.0,Man
299,I am a developer by profession,Yes,Pakistan,19,,Assembly;C;C++;Java;Python;SQL,Facebook,25.0,Man
311,I am a developer by profession,No,Pakistan,5,2600.0,Assembly;C;C++;C#;HTML/CSS;Java;Python;Scala;SQL,LinkedIn,24.0,Man
...,...,...,...,...,...,...,...,...,...
88862,I am a student who is learning to code,Yes,Pakistan,3,,Java,WhatsApp,21.0,Man
5439,,Yes,Pakistan,2,,,Instagram,24.0,Woman
39117,,Yes,Pakistan,4,,C;C++;C#;HTML/CSS;Java;JavaScript;SQL,WhatsApp,22.0,Man
60066,,Yes,Pakistan,4,,Assembly;C++;C#;HTML/CSS;Java;PHP;Python;SQL,YouTube,20.0,Man


In [5]:
import pandas as pd
schema = pd.read_csv('datasets/so_survey_subset_schema.csv', index_col='Column')
schema

Unnamed: 0_level_0,QuestionText
Column,Unnamed: 1_level_1
Respondent,Randomized respondent ID number (not in order ...
MainBranch,Which of the following options best describes ...
Hobbyist,Do you code as a hobby?
Country,In which country do you currently reside?
YearsCode,"Including any education, how many years have y..."
ConvertedComp,Salary converted to annual USD salaries using ...
LanguageWorkedWith,"Which of the following programming, scripting,..."
SocialMedia,What social media site do you use the most?
Age,What is your age (in years)? If you prefer not...
Gender,Which of the following do you currently identi...


In [6]:
schema.loc['Hobbyist']

QuestionText    Do you code as a hobby?
Name: Hobbyist, dtype: object

In [7]:
df['Hobbyist']

Respondent
1        Yes
2         No
3        Yes
4         No
5        Yes
        ... 
88377    Yes
88601     No
88802     No
88816     No
88863    Yes
Name: Hobbyist, Length: 88883, dtype: object

In [None]:
schema.loc['Country']

In [None]:
df['Country']

In [None]:
schema.loc['ConvertedComp']

In [None]:
df['ConvertedComp']

In [None]:
schema.loc['LanguageWorkedWith']

In [None]:
!cat datasets/so_survey_subset_schema.csv

In [None]:
df['LanguageWorkedWith']

In [None]:
schema.loc['SocialMedia']

In [None]:
df['SocialMedia']

In [None]:
df

##### Let us perform some basic statistical analysis on the Dataset

In [8]:
# Returns the count of non-NA values for a series object.
df['Hobbyist'].count()

88883

In [9]:
# Returns a Series containing counts of unique rows.
df['Hobbyist'].value_counts()

Yes    71257
No     17626
Name: Hobbyist, dtype: int64

In [10]:
# Returns the count of non-NA values for a series object.
df['Country'].count()

88751

In [11]:
# Returns a Series containing counts of unique rows.
df['Country'].value_counts()

United States        20949
India                 9061
Germany               5866
United Kingdom        5737
Canada                3395
                     ...  
Tonga                    1
Timor-Leste              1
North Korea              1
Brunei Darussalam        1
Chad                     1
Name: Country, Length: 179, dtype: int64

### To get the count of countries whose developers participated in the survey

In [12]:
df['Country'].value_counts().count()

179

In [13]:
# Returns the count of non-NA values for a series object.
df['ConvertedComp'].count()

55823

In [14]:
# Returns a Series containing counts of unique rows.
df['ConvertedComp'].value_counts()

2000000.0    709
1000000.0    558
120000.0     502
100000.0     480
150000.0     434
            ... 
411096.0       1
261228.0       1
82322.0        1
66424.0        1
588012.0       1
Name: ConvertedComp, Length: 9162, dtype: int64

In [15]:
df['ConvertedComp'].mean()

127110.73842323056

In [16]:
df['ConvertedComp'].median()

57287.0

In [None]:
df.describe()

<h1 align="center">Let us try answering certain Questions</h1>

##  Question 1: 
>**List the most popular SocialMedia web site for every Country**

**Let us first  do the easy task, and that is to list the most popular SocialMedia website of a single country (lets say Pakistan)**

In [None]:
df[df.Country=='Pakistan']

In [None]:
df.loc[df.Country=='Pakistan', 'SocialMedia']

In [None]:
df.loc[df.Country=='Pakistan', 'SocialMedia'].value_counts()

In [17]:
df.columns

Index(['MainBranch', 'Hobbyist', 'Country', 'YearsCode', 'ConvertedComp',
       'LanguageWorkedWith', 'SocialMedia', 'Age', 'Gender'],
      dtype='object')

In [20]:
df.groupby('Country').get_group('Pakistan').loc[:,'SocialMedia'].value_counts()

WhatsApp                    266
Facebook                    232
YouTube                     182
LinkedIn                     71
Twitter                      58
Instagram                    41
Reddit                       28
I don't use social media     23
Snapchat                      5
Hello                         1
VK ВКонта́кте                 1
Name: SocialMedia, dtype: int64

In [None]:
df.loc[df.Country =='Pakistan', :]
df.loc[df.Country =='Pakistan', 'SocialMedia'].head(10)
df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts()
df.loc[df.Country =='Pakistan', 'SocialMedia'].value_counts(normalize=True)
df.loc[df.Country =='China', 'SocialMedia'].value_counts()

In [None]:
df.groupby('Country')

In [None]:
df.groupby('Country').get_group("Pakistan").head()

In [None]:
df.groupby('Country').get_group("Pakistan").loc[:, 'SocialMedia']

In [None]:
df.groupby('Country').get_group("Pakistan").loc[:, 'SocialMedia'].value_counts()

In [None]:
df.groupby('Country')['SocialMedia'].value_counts().head(60)

In [None]:
df.groupby('Country')['SocialMedia'].value_counts().head(50)

In [None]:
df.groupby('Country')['SocialMedia'].value_counts().head(50)

##  Question 2: 
>**What percentage of people in each country knows Python programming?**

**tc** = Total count of people from each country who participated in the survey?

**pc** = Python People: Count of people from each country who knows Python

**tc (option 1):**

In [None]:
df

In [21]:
df.loc[:, 'Country']

Respondent
1                United Kingdom
2        Bosnia and Herzegovina
3                      Thailand
4                 United States
5                       Ukraine
                  ...          
88377                    Canada
88601                       NaN
88802                       NaN
88816                       NaN
88863                     Spain
Name: Country, Length: 88883, dtype: object

In [22]:
tc = df['Country'].value_counts()
tc.name = 'Total'
tc

United States        20949
India                 9061
Germany               5866
United Kingdom        5737
Canada                3395
                     ...  
Tonga                    1
Timor-Leste              1
North Korea              1
Brunei Darussalam        1
Chad                     1
Name: Total, Length: 179, dtype: int64

**tc (option 2):**

In [23]:
dfgb = df.groupby('Country')
dfgb

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f07748dc880>

In [24]:
df.groupby('Country')['Country']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f07748dc9d0>

In [25]:
df.groupby('Country')['Country'].apply(lambda x: x.value_counts()).sort_values()

Country                                                           
Niger                             Niger                                   1
Dominica                          Dominica                                1
Papua New Guinea                  Papua New Guinea                        1
Saint Kitts and Nevis             Saint Kitts and Nevis                   1
Saint Vincent and the Grenadines  Saint Vincent and the Grenadines        1
                                                                      ...  
Canada                            Canada                               3395
United Kingdom                    United Kingdom                       5737
Germany                           Germany                              5866
India                             India                                9061
United States                     United States                       20949
Name: Country, Length: 179, dtype: int64

**pc:**

In [26]:
df.loc[:, 'LanguageWorkedWith']

Respondent
1                          HTML/CSS;Java;JavaScript;Python
2                                      C++;HTML/CSS;Python
3                                                 HTML/CSS
4                                      C;C++;C#;Python;SQL
5              C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA
                               ...                        
88377                        HTML/CSS;JavaScript;Other(s):
88601                                                  NaN
88802                                                  NaN
88816                                                  NaN
88863    Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript...
Name: LanguageWorkedWith, Length: 88883, dtype: object

In [27]:
df.groupby('Country')['LanguageWorkedWith']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f07747582b0>

In [37]:
df.groupby('Country')['LanguageWorkedWith'].apply(lambda x:x.str.contains('Python').sum())

Country
Afghanistan                              8
Albania                                 23
Algeria                                 40
Andorra                                  0
Angola                                   2
                                        ..
Venezuela, Bolivarian Republic of...    28
Viet Nam                                78
Yemen                                    3
Zambia                                   4
Zimbabwe                                14
Name: LanguageWorkedWith, Length: 179, dtype: int64

In [28]:
df.groupby('Country')['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python'))

Respondent
1         True
2         True
3        False
4         True
5         True
         ...  
88182    False
88212     True
88282    False
88377    False
88863    False
Name: LanguageWorkedWith, Length: 88751, dtype: object

In [29]:
pp = df.groupby('Country')['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
pp

Country
Afghanistan                              8
Albania                                 23
Algeria                                 40
Andorra                                  0
Angola                                   2
                                        ..
Venezuela, Bolivarian Republic of...    28
Viet Nam                                78
Yemen                                    3
Zambia                                   4
Zimbabwe                                14
Name: LanguageWorkedWith, Length: 179, dtype: int64

In [30]:
pp.name = 'Knows Python'

**Create a Dataframe of two series tc and pp**

In [31]:
resultdf = pd.concat([tc, pp], axis=1)
resultdf

Unnamed: 0,Total,Knows Python
United States,20949,10083
India,9061,3105
Germany,5866,2451
United Kingdom,5737,2384
Canada,3395,1558
...,...,...
Tonga,1,0
Timor-Leste,1,1
North Korea,1,0
Brunei Darussalam,1,0


In [32]:
resultdf.loc['Pakistan']

Total           923
Knows Python    251
Name: Pakistan, dtype: int64

In [33]:
resultdf.loc['India']

Total           9061
Knows Python    3105
Name: India, dtype: int64

**Percentage of people in each country knows Python?**

In [38]:
resultdf['Percentage'] = (resultdf['Knows Python'] / resultdf['Total']) * 100
resultdf

Unnamed: 0,Total,Knows Python,Percentage
United States,20949,10083,48.131176
India,9061,3105,34.267741
Germany,5866,2451,41.783157
United Kingdom,5737,2384,41.554820
Canada,3395,1558,45.891016
...,...,...,...
Tonga,1,0,0.000000
Timor-Leste,1,1,100.000000
North Korea,1,0,0.000000
Brunei Darussalam,1,0,0.000000


In [39]:
resultdf.loc['Pakistan']

Total           923.000000
Knows Python    251.000000
Percentage       27.193933
Name: Pakistan, dtype: float64

In [None]:
resultdf.sample(20).sort_values(by ='Percentage', ascending=False)

## Let's create your own five questions and try to answer these questions.

In [2]:
# Q-01 ==???
# Q-02 ==???
# Q-03 ==???
# Q-04 ==???
# Q-05 ==???

## Check Your Concepts:
- What is Pandas?

## Practice Questions

### Write a Pandas program to split the following dataframe into groups based on school code. Also check the type of GroupBy object.

In [6]:
dict1={
    'school_code': ['s001','s002','s003','s001','s002','s004'],
    'class': ['V', 'V', 'VI', 'VI', 'V', 'VI'],
    'name': ['Alberto Franco','Gino Mcneill','Ryan Parkes', 'Eesha Hinton', 'Gino Mcneill', 'David Parkes'],
    'date_Of_Birth ': ['15/05/2002','17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],
    'age': [12, 12, 13, 13, 14, 12],
    'height': [173, 192, 186, 167, 151, 159],
    'weight': [35, 32, 33, 30, 31, 32],
    'address': ['street1', 'street2', 'street3', 'street1', 'street2', 'street4']}
df = pd.DataFrame(dict1)
df

Unnamed: 0,school_code,class,name,date_Of_Birth,age,height,weight,address
0,s001,V,Alberto Franco,15/05/2002,12,173,35,street1
1,s002,V,Gino Mcneill,17/05/2002,12,192,32,street2
2,s003,VI,Ryan Parkes,16/02/1999,13,186,33,street3
3,s001,VI,Eesha Hinton,25/09/1998,13,167,30,street1
4,s002,V,Gino Mcneill,11/05/2002,14,151,31,street2
5,s004,VI,David Parkes,15/09/1997,12,159,32,street4


### Write a Pandas program to split the following dataframe by school code and get mean, min, and max value of age for each school.

### Write a Pandas program to split the above given dataframe into groups based on school code and class.

### Write a Pandas program to split the above given dataframe into groups based on school code and cast grouping as a list.(Hint : list(groupbyObject))

### Write a Pandas program to split the following given dataframe into groups based on school code and call a specific group with the name of the group. 

## Practice Exercise 01

#### Step 1. Import the necessary libraries

#### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/drinks.csv). 

#### Step 3. Assign it to a variable called drinks.

In [33]:
drinks = pd.read_csv('datasets/drinks.csv')
# drinks.head()

#### Step 4. Which continent drinks more beer on average?

List of all continets
```
Asia 
Africa 
Europe 
North America 
South America
Australia/Oceania 
Antarctica
```

#### Step 5. For each continent print the statistics for wine consumption.

#### Step 6. Print the mean alcohol consumption per continent for every column

#### Step 7. Print the median alcohol consumption per continent for every column

#### Step 8. Print the mean, min and max values for spirit consumption.
#### This time output a DataFrame

## Practice Exercise 02

#### Step 1. Import the necessary libraries

#### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/bsef19m521/DatasetsForProjects/master/occupation.user). 

#### Step 3. Assign it to a variable called users.

In [74]:
users = pd.read_csv('datasets/occupation.user', delimiter="|")
# users.head()

#### Step 4. Discover what is the mean age per occupation.

#### Step 5. Discover the Male ratio per occupation and sort it from the most to the least
First 1: We calculate value count of each occupation.         
Second 2: We find value count of gender `Male` against each occupation              
Third 3: We find percentage and sort our result          

#### Step 6. For each occupation, calculate the minimum and maximum ages

#### Step 7. For each combination of occupation and gender, calculate the mean age

#### Step 8.  For each occupation represents the percentage of women and men.

In [None]:
# create a data frame and apply count to gender
gender_ocup = users.groupby(['occupation', 'gender']).agg({'gender': 'count'})

# create a DataFrame and apply count for each occupation
occup_count = users.groupby(['occupation']).agg('count')

# divide the gender_ocup per the occup_count and multiply per 100
occup_gender = gender_ocup.div(occup_count, level = "occupation") * 100

# present all rows from the 'gender column'
occup_gender.loc[: , 'gender']

# Pandas - Assignment no 09
- Here is link of Pandas - [Assignment no 09]()

### [Project : Exploring Ebay Car Sales Data](https://github.com/AnshuTrivedi/Data-Scientist-In-Python/blob/master/Projects/step_2/Course_1/Exploring%20Ebay%20Car%20Sales%20Data.ipynb)

**The aim of this project is to clean the data and analyze the included used car listings.**

### Introduction
- In this guided project, we'll work with a dataset of used cars from `eBay` Kleinanzeigen.
- The dataset was originally scraped and uploaded to Kaggle. We've made a few modifications from the original dataset that was uploaded to Kaggle:
    - We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment
    - We dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)



In [1]:
# import libraries

import pandas as pd 
import numpy as np


In [2]:


#load dataset file
# autos=pd.read_csv('autos.csv') , it will create error,try this code uncommenting it
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 23: invalid continuation byte 



In [5]:
pd.read_csv('datasets/autos.csv', encoding='Latin-1').head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Observation:

- Dataset contains 20 columns in which most of(15) are strings
- Some columns have null values , none of them have more than ~20% null values
     - As minimim count is about 40,000 in 'notRepairedDamage',so % of null values is (10,000/50,000)*100
- The column names use `camelcase` instead of Python's preferred `snakecase`, which means we can't just replace spaces with underscores.

