# Exercise 1: Data cleaning

Before doing actual data analysis, we usually first need to clean the data. 
This might involve steps such as dealing with missing values and encoding categorical variables as integers.
In this exercise, you will perform such steps based on the Titanic passenger data.

1. Load the Titanic data set in `titanic.csv` located in the `data/` folder.
2. Report the number of observations with missing `Age`, for example using [`isna()`](https://pandas.pydata.org/docs/reference/api/pandas.isna.html).
3. Compute the average age in the data set. Use the following approaches and compare your results:
    1.  Use pandas's [`mean()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html) method.
    2.  Convert the `Age` column to a NumPy array using [`to_numpy()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html). Experiment with NumPy's [`np.mean()`](https://numpy.org/doc/2.0/reference/generated/numpy.mean.html) and [`np.nanmean()`](https://numpy.org/doc/2.0/reference/generated/numpy.nanmean.html) to see if you obtain the same results.
4. Replace the all missing ages with the mean age you computed above, rounded to the nearest integer.
   Note that in "real" applications, replacing missing values with sample means is usually not a good idea.
5. Convert this updated `Age` column to integer type using [`astype()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html).
6. Generate a new column `Female` which takes on the value one if `Sex` is equal to `"female"` and zero otherwise. 
   This is called an _indicator_ or _dummy_ variable, and is preferrable to storing such categorical data as strings.
   Delete the original column `Sex`.
7. Save your cleaned data set as `titanic-clean.csv` using [`to_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) with `,` as the field separator.
   Tell `to_csv()` to *not* write the `DataFrame` index to the CSV file as it's not needed in this example.

In [2]:
import pandas as pd
import numpy as np
DATA_PATH = '../../data'
fn=f'{DATA_PATH}/titanic.csv'
df=pd.read_csv(fn)

In [40]:
df['Age'].isna().sum()

np.int64(177)

In [36]:
avgage=df["Age"].mean()

In [37]:
ar=df["Age"].to_numpy()
np.nanmean(ar)


np.float64(29.69911764705882)

In [None]:
df['Age']=df['Age'].fillna(avgage)

In [31]:
df['Age']=df['Age'].astype(int)

In [45]:
df["Female"]=(df["Sex"]=="female").astype(int)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked,Female
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.2500,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,STON/O2. 3101282,7.9250,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1000,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.0500,,S,0
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,211536,13.0000,,S,0
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,112053,30.0000,B42,S,1
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,W./C. 6607,23.4500,,S,1
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,111369,30.0000,C148,C,0


In [58]:
df.set_index('Sex',inplace=True)
df=df.reset_index(drop=True)

In [4]:
df.to_csv('titanic-clean.csv',sep=',',index=False)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,211536,13.0000,,S
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,111369,30.0000,C148,C


***
# Exercise 2: Selecting subsets of data

In this exercise, you are asked to select subsets of macroeconomic data for the United States based on some criteria.

1.  Load the annual data from FRED which are located in `FRED_annual.xlsx` 
    in the `data/FRED` folder.

2.  Print the list of columns and the number of non-missing observations.

3.  Since we are dealing with time series data, set the column `Year` as the DataFrame index.

4.  Print all observations for the 1960s decade using at least two different methods.

5.  Using the data in the column `GDP`, compute the annual GDP growth in percent and store it in the column `GDP_growth`. Select the years in which

    1.  GDP growth was above 5%.
    2.  GDP growth was negative, but inflation as still above 5% (such episodes are called "stagflation" since usually negative GDP growth is associated with low inflation).
        
        Use at least two methods to select such years.

    *Hint:* You can compute changes relative to the previous observation using the 
    [`pct_change()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pct_change.html) method.


In [75]:
fn=f'{DATA_PATH}/FRED/FRED_annual.xlsx'
df=pd.read_excel(fn)

In [76]:
print(df.columns)
print(df.count()) 

Index(['Year', 'GDP', 'CPI', 'UNRATE', 'FEDFUNDS', 'INFLATION'], dtype='object')
Year         70
GDP          70
CPI          70
UNRATE       70
FEDFUNDS     70
INFLATION    69
dtype: int64


In [77]:
df.set_index('Year', inplace=True)
df

Unnamed: 0_level_0,GDP,CPI,UNRATE,FEDFUNDS,INFLATION
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1954,2877.7,26.9,5.6,1.0,
1955,3083.0,26.8,4.4,1.8,-0.371747
1956,3148.8,27.2,4.1,2.7,1.492537
1957,3215.1,28.1,4.3,3.1,3.308824
1958,3191.2,28.9,6.8,1.6,2.846975
...,...,...,...,...,...
2019,20715.7,255.7,3.7,2.2,1.831939
2020,20267.6,258.8,8.1,0.4,1.212358
2021,21494.8,271.0,5.4,0.1,4.714065
2022,22034.8,292.6,3.6,1.7,7.970480


In [80]:
df.loc['1960':'1969']

Unnamed: 0_level_0,GDP,CPI,UNRATE,FEDFUNDS,INFLATION
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1960,3500.3,29.6,5.5,3.2,1.369863
1961,3590.1,29.9,6.7,2.0,1.013514
1962,3810.1,30.3,5.6,2.7,1.337793
1963,3976.1,30.6,5.6,3.2,0.990099
1964,4205.3,31.0,5.2,3.5,1.30719
1965,4478.6,31.5,4.5,4.1,1.612903
1966,4773.9,32.5,3.8,5.1,3.174603
1967,4904.9,33.4,3.8,4.2,2.769231
1968,5145.9,34.8,3.6,5.7,4.191617
1969,5306.6,36.7,3.5,8.2,5.45977


In [82]:
df.iloc[6:16]

Unnamed: 0_level_0,GDP,CPI,UNRATE,FEDFUNDS,INFLATION
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1960,3500.3,29.6,5.5,3.2,1.369863
1961,3590.1,29.9,6.7,2.0,1.013514
1962,3810.1,30.3,5.6,2.7,1.337793
1963,3976.1,30.6,5.6,3.2,0.990099
1964,4205.3,31.0,5.2,3.5,1.30719
1965,4478.6,31.5,4.5,4.1,1.612903
1966,4773.9,32.5,3.8,5.1,3.174603
1967,4904.9,33.4,3.8,4.2,2.769231
1968,5145.9,34.8,3.6,5.7,4.191617
1969,5306.6,36.7,3.5,8.2,5.45977


In [84]:
df["GDP_growth"]=(df['GDP']-df['GDP'].shift(1))/df['GDP'].shift(1)*100
df

Unnamed: 0_level_0,GDP,CPI,UNRATE,FEDFUNDS,INFLATION,GDP_growth
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1954,2877.7,26.9,5.6,1.0,,
1955,3083.0,26.8,4.4,1.8,-0.371747,7.134170
1956,3148.8,27.2,4.1,2.7,1.492537,2.134285
1957,3215.1,28.1,4.3,3.1,3.308824,2.105564
1958,3191.2,28.9,6.8,1.6,2.846975,-0.743367
...,...,...,...,...,...,...
2019,20715.7,255.7,3.7,2.2,1.831939,2.583949
2020,20267.6,258.8,8.1,0.4,1.212358,-2.163094
2021,21494.8,271.0,5.4,0.1,4.714065,6.054984
2022,22034.8,292.6,3.6,1.7,7.970480,2.512236


In [85]:
df[df['GDP_growth']>5]

Unnamed: 0_level_0,GDP,CPI,UNRATE,FEDFUNDS,INFLATION,GDP_growth
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1955,3083.0,26.8,4.4,1.8,-0.371747,7.13417
1959,3412.4,29.2,5.4,3.3,1.038062,6.931562
1962,3810.1,30.3,5.6,2.7,1.337793,6.127963
1964,4205.3,31.0,5.2,3.5,1.30719,5.764443
1965,4478.6,31.5,4.5,4.1,1.612903,6.498942
1966,4773.9,32.5,3.8,5.1,3.174603,6.593578
1972,5780.0,41.8,5.6,4.4,3.209877,5.25549
1973,6106.4,44.4,4.9,8.7,6.220096,5.647059
1976,6387.4,56.9,7.7,5.0,5.762082,5.386989
1978,7052.7,65.2,6.1,7.9,7.590759,5.535105


In [87]:
df.query('GDP_growth>5')

Unnamed: 0_level_0,GDP,CPI,UNRATE,FEDFUNDS,INFLATION,GDP_growth
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1955,3083.0,26.8,4.4,1.8,-0.371747,7.13417
1959,3412.4,29.2,5.4,3.3,1.038062,6.931562
1962,3810.1,30.3,5.6,2.7,1.337793,6.127963
1964,4205.3,31.0,5.2,3.5,1.30719,5.764443
1965,4478.6,31.5,4.5,4.1,1.612903,6.498942
1966,4773.9,32.5,3.8,5.1,3.174603,6.593578
1972,5780.0,41.8,5.6,4.4,3.209877,5.25549
1973,6106.4,44.4,4.9,8.7,6.220096,5.647059
1976,6387.4,56.9,7.7,5.0,5.762082,5.386989
1978,7052.7,65.2,6.1,7.9,7.590759,5.535105


In [89]:
df[(df["GDP_growth"]<0) & (df['INFLATION']>5)]

Unnamed: 0_level_0,GDP,CPI,UNRATE,FEDFUNDS,INFLATION,GDP_growth
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1974,6073.4,49.3,5.6,10.5,11.036036,-0.540417
1975,6060.9,53.8,8.5,5.8,9.127789,-0.205816
1980,7257.3,82.4,7.2,13.4,13.498623,-0.257009
1982,7307.3,96.5,9.7,12.3,6.160616,-1.8034


In [90]:
df.query('GDP_growth<0 & INFLATION>5')

Unnamed: 0_level_0,GDP,CPI,UNRATE,FEDFUNDS,INFLATION,GDP_growth
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1974,6073.4,49.3,5.6,10.5,11.036036,-0.540417
1975,6060.9,53.8,8.5,5.8,9.127789,-0.205816
1980,7257.3,82.4,7.2,13.4,13.498623,-0.257009
1982,7307.3,96.5,9.7,12.3,6.160616,-1.8034


***
# Exercise 3: Labor market statistics for the US

In this exercise, you are asked to compute some descriptive statistics for the unemployment rate 
and the labor force participation (the fraction of the working-age population in the labor force, i.e., individuals who are either employed or unemployed) for the United States.

1.  Load the monthly time series from FRED which are located in `FRED_monthly.csv` 
    in the `data/FRED` folder.

    *Hint:* You can use `pd.read_csv(..., parse_dates=['DATE'])` to automatically
    parse strings stored in the `DATE` column as dates.

2.  Print the list of columns and the number of non-missing observations.

3.  Since we are dealing with time series data, set the column `DATE` as the DataFrame index. Using the date index, select all observations from the first three months
    of the year 2020.

4.  For the columns `UNRATE` (unemployment rate) and `LFPART` (labor force participation), compute and report 
    the mean, minimum and maximum values for the whole sample. Round your results 
    to one decimal digit.

    *Hint:* You can use the DataFrame methods 
    [`mean()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html),
    [`min()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.min.html), and 
    [`max()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html)
    to compute the desired statistics.

    *Hint:* You can use the DataFrame method
    [`round()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.round.html)
    to truncate the number of decimal digits.

5.  You are interested in how the average unemployment rate evolved over the last 
    few decades. 

    -   Add a new column `Decade` to the DataFrame which contains the 
        starting year for each decade (e.g., this value should be 1950
        for the years 1950-1959, and so on).

        *Hint:* The decade can be computed from the column `Year` using 
        truncated integer division:
        ```python
        df['Year'] // 10 * 10
        ```

    -   Write a loop to compute and report the average unemployment rate (column `UNRATE`) 
        for each decade.

        Include only the decades from 1950 to 2010 for which you have all
        observations. 

In [104]:
fn=f'{DATA_PATH}/FRED/FRED_monthly.csv'
df=pd.read_csv(fn,parse_dates=['DATE'])

In [105]:
print(df.columns)
print(df.count()) 

Index(['DATE', 'Year', 'Month', 'CPI', 'UNRATE', 'FEDFUNDS', 'REALRATE',
       'LFPART'],
      dtype='object')
DATE        924
Year        924
Month       924
CPI         924
UNRATE      924
FEDFUNDS    846
REALRATE    516
LFPART      924
dtype: int64


In [106]:
df.set_index('DATE',inplace=True)

In [113]:
df.loc['2020-01-01':'2020-03-31']

Unnamed: 0_level_0,Year,Month,CPI,UNRATE,FEDFUNDS,REALRATE,LFPART
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-01-01,2020,1,259.1,3.6,1.6,-0.6,63.3
2020-02-01,2020,2,259.2,3.5,1.6,-0.5,63.3
2020-03-01,2020,3,258.1,4.4,0.6,3.4,62.6


In [116]:
print(round(df["UNRATE"].mean(),1))
print(round(df["UNRATE"].min(),1))
print(round(df["UNRATE"].max(),1))
print(round(df["LFPART"].mean(),1))
print(round(df["LFPART"].min(),1))
print(round(df["LFPART"].max(),1))

5.7
2.5
14.8
62.8
58.1
67.3


In [None]:
df['Decade']=df['Year']//10*10

In [125]:
for i in range(1950,2020,10):
    st=f"{i}-01-01"
    en=f"{i+9}-12-31"
    avg=round(df.loc[st:en,"UNRATE"].mean(),1)
    print(f'{i}:{avg}')

1950:4.5
1960:4.8
1970:6.2
1980:7.3
1990:5.8
2000:5.5
2010:6.2


***
# Exercise 4: Working with string data (advanced)

Most of the data we deal with contain strings, i.e., text data (names, addresses, etc.). Often, such data is not in the format needed for analysis, and we have to perform additional string manipulation to extract the exact data we need. This can be achieved using the pandas [string methods](https://pandas.pydata.org/docs/user_guide/text.html#string-methods).

To illustrate, we use the Titanic data set for this exercise.

1.  Load the Titanic data and restrict the sample to men. (This simplifies the task. Women in this data set have much more complicated names as they contain both their husband's and their maiden name)
2.  Print the first five observations of the `Name` column. As you can see, the data is stored in the format _"Last name, Title First name"_ where title is something like Mr., Rev., etc.
3. Split the `Name` column by `,` to extract the last name and the remainder as separate columns. You can achieve this using the [`partition()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.partition.html#pandas.Series.str.partition) string method.
4. Split the remainder (containing the title and first name) using the space character `" "` as separator to obtain individual columns for the title and the first name.
5. Store the three data series in the original `DataFrame` (using the column names `FirstName`, `LastName` and `Title`) and delete the `Name` column which is no longer needed.
6. Finally, extract the ship deck from the values in `Cabin`. The ship deck is the first character in the string stored in `Cabin` (A, B, C, ...). You extract the first character using the 
[`get()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.get.html#pandas.Series.str.get) string method. Store the result in the column `Deck`.

*Hint*: Pandas's string methods can be accessed using the `.str` attribute. For example, to partition values in the column `Name`, you need to use
```python
df['Name'].str.partition()
```


In [26]:
import pandas as pd
import numpy as np
DATA_PATH = '../../data'
fn=f'{DATA_PATH}/titanic.csv'
df=pd.read_csv(fn)

In [27]:
df=df[df['Sex']=='male']
df['Name'].iloc[0:5]

0          Braund, Mr. Owen Harris
4         Allen, Mr. William Henry
5                 Moran, Mr. James
6          McCarthy, Mr. Timothy J
7    Palsson, Master Gosta Leonard
Name: Name, dtype: object

In [28]:
df[["Last Name","Sep","Rest"]]=df['Name'].str.partition(',')
df.set_index('Name', inplace=True)
df.reset_index(drop=True)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Ticket,Fare,Cabin,Embarked,Last Name,Sep,Rest
0,1,0,3,male,22.0,A/5 21171,7.2500,,S,Braund,",",Mr. Owen Harris
1,5,0,3,male,35.0,373450,8.0500,,S,Allen,",",Mr. William Henry
2,6,0,3,male,,330877,8.4583,,Q,Moran,",",Mr. James
3,7,0,1,male,54.0,17463,51.8625,E46,S,McCarthy,",",Mr. Timothy J
4,8,0,3,male,2.0,349909,21.0750,,S,Palsson,",",Master Gosta Leonard
...,...,...,...,...,...,...,...,...,...,...,...,...
572,884,0,2,male,28.0,C.A./SOTON 34068,10.5000,,S,Banfield,",",Mr. Frederick James
573,885,0,3,male,25.0,SOTON/OQ 392076,7.0500,,S,Sutehall,",",Mr. Henry Jr
574,887,0,2,male,27.0,211536,13.0000,,S,Montvila,",",Rev. Juozas
575,890,1,1,male,26.0,111369,30.0000,C148,C,Behr,",",Mr. Karl Howell


In [30]:
df['Rest']=df['Rest'].str.strip()
df[["Title","Sep","First Name"]]=df["Rest"].str.partition(' ')
df.set_index('Rest',inplace=True)
df.reset_index(drop=True)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Ticket,Fare,Cabin,Embarked,Last Name,Sep,Title,First Name
0,1,0,3,male,22.0,A/5 21171,7.2500,,S,Braund,,Mr.,Owen Harris
1,5,0,3,male,35.0,373450,8.0500,,S,Allen,,Mr.,William Henry
2,6,0,3,male,,330877,8.4583,,Q,Moran,,Mr.,James
3,7,0,1,male,54.0,17463,51.8625,E46,S,McCarthy,,Mr.,Timothy J
4,8,0,3,male,2.0,349909,21.0750,,S,Palsson,,Master,Gosta Leonard
...,...,...,...,...,...,...,...,...,...,...,...,...,...
572,884,0,2,male,28.0,C.A./SOTON 34068,10.5000,,S,Banfield,,Mr.,Frederick James
573,885,0,3,male,25.0,SOTON/OQ 392076,7.0500,,S,Sutehall,,Mr.,Henry Jr
574,887,0,2,male,27.0,211536,13.0000,,S,Montvila,,Rev.,Juozas
575,890,1,1,male,26.0,111369,30.0000,C148,C,Behr,,Mr.,Karl Howell


In [33]:
df['Deck']=df['Cabin'].str[0]
df

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Sex,Age,Ticket,Fare,Cabin,Embarked,Last Name,Sep,Title,First Name,Deck
Rest,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Mr. Owen Harris,1,0,3,male,22.0,A/5 21171,7.2500,,S,Braund,,Mr.,Owen Harris,
Mr. William Henry,5,0,3,male,35.0,373450,8.0500,,S,Allen,,Mr.,William Henry,
Mr. James,6,0,3,male,,330877,8.4583,,Q,Moran,,Mr.,James,
Mr. Timothy J,7,0,1,male,54.0,17463,51.8625,E46,S,McCarthy,,Mr.,Timothy J,E
Master Gosta Leonard,8,0,3,male,2.0,349909,21.0750,,S,Palsson,,Master,Gosta Leonard,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Mr. Frederick James,884,0,2,male,28.0,C.A./SOTON 34068,10.5000,,S,Banfield,,Mr.,Frederick James,
Mr. Henry Jr,885,0,3,male,25.0,SOTON/OQ 392076,7.0500,,S,Sutehall,,Mr.,Henry Jr,
Rev. Juozas,887,0,2,male,27.0,211536,13.0000,,S,Montvila,,Rev.,Juozas,
Mr. Karl Howell,890,1,1,male,26.0,111369,30.0000,C148,C,Behr,,Mr.,Karl Howell,C
