## Exercise 6.7
### 1: Going from wide to long

You can move multiple columns into a single column (making the data long and skinny) by "melting" multiple columns. In this exercise, you will practice doing this.

The `users` DataFrame has been pre-loaded for you. 

#### Instructions (1 points)

* Define a DataFrame `skinny` where you melt the `'visitors'` and `'staff'` columns of `users` into a single column.

In [1]:
import pandas as pd
users = pd.read_csv('https://github.com/huangpen77/BUDT704/raw/main/Chapter07/users.csv', index_col=0)
users

Unnamed: 0,weekday,city,visitors,staff
0,Sun,Austin,139,7
1,Sun,Dallas,237,12
2,Mon,Austin,326,3
3,Mon,Dallas,456,5


In [2]:
# Melt users: skinny
skinny = users.melt(id_vars=['visitors','staff'], value_vars=['weekday','city'])
skinny

Unnamed: 0,visitors,staff,variable,value
0,139,7,weekday,Sun
1,237,12,weekday,Sun
2,326,3,weekday,Mon
3,456,5,weekday,Mon
4,139,7,city,Austin
5,237,12,city,Dallas
6,326,3,city,Austin
7,456,5,city,Dallas


Because `var_name` or `value_name` parameters weren't specified, the melted DataFrame has the default variable and value column names.

#### Instructions (1 points)
This time use the `var_name` and set it to 'user_type', and `value_name` as 'count'

In [8]:
# Melt users: skinny
skinny = users.melt(id_vars=['weekday','city'], value_vars=['visitors','staff'], var_name = 'user_type', value_name='count')
skinny

Unnamed: 0,weekday,city,user_type,count
0,Sun,Austin,visitors,139
1,Sun,Dallas,visitors,237
2,Mon,Austin,visitors,326
3,Mon,Dallas,visitors,456
4,Sun,Austin,staff,7
5,Sun,Dallas,staff,12
6,Mon,Austin,staff,3
7,Mon,Dallas,staff,5


### 2: Going back to wide from long

#### instructions (1 points)
Use `.pivot()` method, convert skinny dataframe back to long format. You will use both `weekday` and `city` as the `index`, `user_type` as `columns`, and `count` as `values`. 

In [9]:
df = skinny.pivot(index=['weekday','city'], columns='user_type', values='count')
df

Unnamed: 0_level_0,user_type,staff,visitors
weekday,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Mon,Austin,3,326
Mon,Dallas,5,456
Sun,Austin,7,139
Sun,Dallas,12,237


Note that the `df` dataframe is not exactly the same as `users` dataframe. It has a MultiIndex for the row, and a columns name for the columns. We can remedy that. 

#### Instructions (1 points)
- set the `columns.name` property of `df` to an empty string.
- reset the index of `df` by calling the `reset_index()` method.

In [10]:
df.columns.name = ""
df.reset_index()

Unnamed: 0,weekday,city,staff,visitors
0,Mon,Austin,3,326
1,Mon,Dallas,5,456
2,Sun,Austin,7,139
3,Sun,Dallas,12,237


## Exercise 6.8
### 1: Grouping by columns

In this exercise, you will be working with the Titanic dataset and use `.groupby()` to analyze the distribution of passengers who boarded the Titanic.

The `'pclass'` column identifies which class of ticket was purchased by the passenger and the `'embarked'` column indicates at which of the three ports the passenger boarded the Titanic. `'S'` stands for Southampton, England, `'C'` for Cherbourg, France and `'Q'` for Queenstown, Ireland.

The DataFrame has been pre-loaded as `titanic`.

#### Titanic Data

In [11]:
import pandas as pd
titanic = pd.read_csv('https://github.com/huangpen77/BUDT704/raw/main/Chapter07/titanic.csv', index_col='name')
titanic.head(3)

Unnamed: 0_level_0,pclass,survived,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
"Allen, Miss. Elisabeth Walton",1,1,female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
"Allison, Master. Hudson Trevor",1,1,male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
"Allison, Miss. Helen Loraine",1,0,female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [12]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1309 entries, Allen, Miss. Elisabeth Walton to Zimmerman, Mr. Leo
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   sex        1309 non-null   object 
 3   age        1046 non-null   float64
 4   sibsp      1309 non-null   int64  
 5   parch      1309 non-null   int64  
 6   ticket     1309 non-null   object 
 7   fare       1308 non-null   float64
 8   cabin      295 non-null    object 
 9   embarked   1307 non-null   object 
 10  boat       486 non-null    object 
 11  body       121 non-null    float64
 12  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(6)
memory usage: 143.2+ KB


#### Instructions (2 points)

* Group by the `'pclass'` column and then aggregate the `'ticket'` column using `.count()`. Save the result as `passengers_by_class`. Print it out.

In [13]:
# Group by the `'pclass'` column and then aggregate the `'ticket'` column using `.count()`.
passengers_by_class = titanic.groupby('pclass')['ticket'].count()
passengers_by_class

pclass
1    323
2    277
3    709
Name: ticket, dtype: int64

#### Instructions (2 points)
- Group by the `'pclass'` column and then aggregate `'survived'` column using `sum()`, and save the result in `survived_by_class`. Print it out.

In [14]:
# Group by the `'pclass'` column and then aggregate `survived` column `sum()`
survived_by_class = titanic.groupby('pclass')['survived'].sum()
survived_by_class

pclass
1    200
2    119
3    181
Name: survived, dtype: int64

#### Instructions (2 points)
- we can use the two pandas Series, `passengers_by_class` and `survived_by_class`, to calculate the survival rate of each passenger class. Divide `survived_by_class` by `passengers_by_class`, and store the result in `prob_by_class`. Print it out to see the survival rates.

In [15]:
# Divide `survived_by_class` by `passengers_by_class`, and store the result in `prob_by_class`
prob_by_class = survived_by_class/passengers_by_class
prob_by_class

pclass
1    0.619195
2    0.429603
3    0.255289
dtype: float64

### 2: Grouping by multiple columns
In this exercise, you will practice group by multiple columns.
#### Instructions (2 points)
- Group titanic by `'embarked'` and `'survived'`, and aggregate the `ticket` column by `count()`. Store the result in `count_multi`

In [16]:
# Group titanic by `'embarked'` and `'survived'`, and aggregate the `ticket` column by `count()`.
count_mult = titanic.groupby(['embarked','survived'])['ticket'].count()
count_mult

embarked  survived
C         0           120
          1           150
Q         0            79
          1            44
S         0           610
          1           304
Name: ticket, dtype: int64

#### Instructions (2 points)
- as we see from the previous output, `count_mult` is a pandas Series. We can find out its index by using its `index` property.

In [17]:
# use the `index` property to find out the index of `count_multi`
count_mult.index

MultiIndex([('C', 0),
            ('C', 1),
            ('Q', 0),
            ('Q', 1),
            ('S', 0),
            ('S', 1)],
           names=['embarked', 'survived'])

#### Instructions (2 points)
- Now we can use the index to retrieve the values in `count_multi`. For example, `count_mult[('C', 1)]` tells you how many passengers departing from Cherbourg, France had survived, and `count_mult[('C', 0)]` tells you how many passengers departing from Cherbourg, France had deseased. Using this information, write the code to calculate the survival rate of passengers departing from Cherbourg, France. Store the result in `prob_C`.

In [42]:
# calculate the survival rate of passengers departing from Cherbourg, France, and store the result in `prob_C`.
# survived / survived + deceased
prob_C = count_mult[('C', 1)]/(count_mult[('C', 1)] + count_mult[('C', 0)])

print('The survival rate of passengers departing from Cherbourg, France is {:.2f} percent.'.format(prob_C * 100))

The survival rate of passengers departing from Cherbourg, France is 55.56 percent.


## Exercise 6.9
### 1: transformations with .apply()

The `.apply()` method, when used on a groupby object, performs a predefined function on each of the groups. These functions can be aggregations, transformations or more complex workflows. The `.apply()` method will then combine the results in an intelligent way.

In this exercise, you're going to analyze economic disparity within regions of the world using the Gapminder data set for 2010. To do this you'll define a function to compute the aggregate spread of per capita GDP in each region and the individual country's z-score of the regional per capita GDP. You'll then select six countries - United States, Mexico, United Kingdom, Poland, China and Japan - to see a summary of the regional GDP and that country's z-score against the regional mean.

The 2010 Gapminder DataFrame is provided for you as `gapminder_2010`. Pandas has been imported as `pd`.

The function `disparity()` has been defined for your use.

#### Instructions (4 points)

* Group `gapminder_2010` by `'region'`. Save the result as `regional`.
* Apply the provided `disparity` function on `regional`, and save the result as `reg_disp`.
* Use `.loc[]` to select `['United States', 'Mexico', 'United Kingdom', 'Poland', 'China', 'Japan']` from `reg_disp` and print the results.

In [24]:
def disparity(gr):
    # Compute the spread of gr['gdp']: s
    s = gr['gdp'].max() - gr['gdp'].min()
    # Compute the z-score of gr['gdp'] as (gr['gdp']-gr['gdp'].mean())/gr['gdp'].std(): z
    z = (gr['gdp'] - gr['gdp'].mean())/gr['gdp'].std()
    # Return a DataFrame with the inputs {'z(gdp)':z, 'regional spread(gdp)':s}
    return pd.DataFrame({'regional standardized(gdp)':z , 'regional spread(gdp)':s, 'region':gr.region})

import pandas as pd
gapminder = pd.read_csv('https://github.com/huangpen77/BUDT704/raw/main/Chapter07/gapminder_tidy.csv', index_col='Country')
gapminder_mask = gapminder['Year'] == 2010
gapminder_2010 = gapminder[gapminder_mask].copy()
gapminder_2010.drop('Year', axis=1, inplace=True)
gapminder_2010.head()

Unnamed: 0_level_0,fertility,life,population,child_mortality,gdp,region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,5.659,59.612,31411743.0,105.0,1637.0,South Asia
Albania,1.741,76.78,3204284.0,16.6,9374.0,Europe & Central Asia
Algeria,2.817,70.615,35468208.0,27.4,12494.0,Middle East & North Africa
Angola,6.218,50.689,19081912.0,182.5,7047.0,Sub-Saharan Africa
Antigua and Barbuda,2.13,75.437,88710.0,9.9,20567.0,America


In [37]:
# Group gapminder_2010 by 'region': regional
regional = gapminder_2010.groupby('region')

# Apply the disparity function on regional: reg_disp
reg_disp = regional.apply(disparity)

# Print the disparity of United States, Mexico, United Kingdom, Poland, China and Japan
reg_disp.loc[['United States', 'Mexico', 'United Kingdom', 'Poland', 'China', 'Japan']]

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  reg_disp = regional.apply(disparity)


Unnamed: 0_level_0,regional standardized(gdp),regional spread(gdp),region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
United States,3.013374,47855.0,America
Mexico,-0.026505,47855.0,America
United Kingdom,0.572873,89037.0,Europe & Central Asia
Poland,-0.292486,89037.0,Europe & Central Asia
China,-0.432756,96993.0,East Asia & Pacific
Japan,0.514261,96993.0,East Asia & Pacific
