# Data Cleaning
---


**Objective:** to get the data for the following information for all countries from December 1959 to December 1990


1. Industrial production (Index)

2. Exchange rates, National Currency per US dollar (Period Average)

3. Consumer prices (All items), index

4. International Reserves and Liquidity (Reserves, Official Reserve Assets, US Dollar)

5. Data for consumer prices and international reserves for the United States only over the same time period.


---

# 1. Downloading the data

We collected the data from ['IMF data portal'](https://data.imf.org/?sk=4c514d48-b6ba-49ed-8ab9-52b0c1a0179b&sid=1390030341854) using the query function to get desired data

the desired data for Germany and the USA can be found in 2 seperate excel files in the data folder of the repository, titled Germany and the USA respectively



---
# 2. Cleaning the data

#### Importing and merging the 2 datasets

In [15]:
import pandas as pd
import warnings

# Suppress FutureWarning (if you prefer to not see them)
warnings.simplefilter("ignore", FutureWarning)

# Define the desired final column names (for 5 data columns)
final_columns = [
    "Time (Year/Month)",
    "Economic Activity, Industrial Production, Index",
    "Exchange Rates, National Currency Per U.S. Dollar, Period Average, Rate",
    "International Reserves and Liquidity, Reserves, Official Reserve Assets, US Dollar",
    "Prices, Consumer Price Index, All items, Index"
]

# --- Process Germany File ---
# Assume Germany.xlsx has the header row in Excel row 3 (so skip the first 2 rows) 
# and the first 5 columns are the ones we need.
germany_df = pd.read_excel("../data/Germany.xlsx", header=0, skiprows=2, usecols=[0,1,2,3,4])
germany_df.columns = final_columns
germany_df["Country"] = "Germany"

# --- Process USA File ---
# Assume USA.xlsx also uses row 3 as header (skip first 2 rows),
# but the file only has 3 columns: Time, International Reserves, and Prices.
usa_df = pd.read_excel("../data/USA.xlsx", header=0, skiprows=2)
# Select the first 3 columns (if there are extra columns, adjust this accordingly)
usa_df = usa_df.iloc[:, :3]

# Rename the existing columns according to what they actually represent in the USA file.
usa_df.columns = [
    "Time (Year/Month)",
    "International Reserves and Liquidity, Reserves, Official Reserve Assets, US Dollar",
    "Prices, Consumer Price Index, All items, Index"
]

# Insert two blank columns for the missing data:
# We need to insert "Economic Activity, Industrial Production, Index" at position 1
usa_df.insert(loc=1, column="Economic Activity, Industrial Production, Index", value=pd.NA)
# And insert "Exchange Rates, National Currency Per U.S. Dollar, Period Average, Rate" at position 2
usa_df.insert(loc=2, column="Exchange Rates, National Currency Per U.S. Dollar, Period Average, Rate", value=pd.NA)

# Now the USA DataFrame has the same 5 columns as defined in final_columns.
usa_df["Country"] = "USA"

# --- Merge the Two DataFrames ---
merged_df = pd.concat([germany_df, usa_df], ignore_index=True)

merged_df 


Unnamed: 0,Time (Year/Month),"Economic Activity, Industrial Production, Index","Exchange Rates, National Currency Per U.S. Dollar, Period Average, Rate","International Reserves and Liquidity, Reserves, Official Reserve Assets, US Dollar","Prices, Consumer Price Index, All items, Index",Country
0,Dec 1959,32.500305,4.2,4811.474341,24.616929,Germany
1,Jan 1960,31.193881,4.2,4724.155785,24.616929,Germany
2,Feb 1960,31.041599,4.2,4806.362830,24.477068,Germany
3,Mar 1960,32.203755,4.2,4966.456016,24.477068,Germany
4,Apr 1960,34.287622,4.2,5236.120624,24.616929,Germany
...,...,...,...,...,...,...
741,Aug 1990,,,78908.838357,60.351608,USA
742,Sep 1990,,,80024.166133,60.856066,USA
743,Oct 1990,,,82852.196532,61.222946,USA
744,Nov 1990,,,83059.402774,61.360525,USA


---

## Construction of variables for Germany:

#### 1. The monthly growth in the nominal exchange rate for Germany: 

I calculated this variable by simply using the inbuilt pct_change function and multiplying the result by 100

In [22]:
import pandas as pd

# Assume merged_df is already created from previous steps

# Filter for Germany and work on a copy
germany_data = merged_df[merged_df["Country"] == "Germany"].copy()

# Convert the "Time (Year/Month)" column to datetime (assuming format like "Dec 1959")
germany_data["Time (Year/Month)"] = pd.to_datetime(germany_data["Time (Year/Month)"], format='%b %Y')

# Sort the data chronologically
germany_data.sort_values("Time (Year/Month)", inplace=True)

# Calculate monthly percentage change for the nominal exchange rate and multiply by 100 to express it as a percentage
germany_data["german_monthly_nominal_exchange_rate_growth"] = (
    germany_data["Exchange Rates, National Currency Per U.S. Dollar, Period Average, Rate"].pct_change() * 100
)

# Create a new DataFrame with only the Time and computed growth columns
german_monthly_change_in_nominal_exchange_rate = germany_data[[
    "Time (Year/Month)", 
    "german_monthly_nominal_exchange_rate_growth"
]].copy()

# (Optional) Reset the index for cleanliness
german_monthly_change_in_nominal_exchange_rate.reset_index(drop=True, inplace=True)

# Now, german_change_in_nominal_exchange_rate contains only the desired two columns
german_monthly_change_in_nominal_exchange_rate


Unnamed: 0,Time (Year/Month),german_monthly_nominal_exchange_rate_growth
0,1959-12-01,
1,1960-01-01,0.000000
2,1960-02-01,0.000000
3,1960-03-01,0.000000
4,1960-04-01,0.000000
...,...,...
368,1990-08-01,-4.219769
369,1990-09-01,-0.063666
370,1990-10-01,-2.955979
371,1990-11-01,-2.382984


#### 2. The monthly growth in the real exchange rate

In [None]:
import pandas as pd
import numpy as np

# ---------------------------
# 1) Filter and prepare Germany data
# ---------------------------
# Assume 'merged_df' is the DataFrame that contains:
#   - "Country" column
#   - "Time (Year/Month)" (e.g. "Dec 1959")
#   - "Prices, Consumer Price Index, All items, Index" (Germany's CPI)
#   - "Exchange Rates, National Currency Per U.S. Dollar, Period Average, Rate" (Germany's nominal exchange rate)
# We'll extract and rename columns for clarity.

germany_data = merged_df.loc[merged_df["Country"] == "Germany", [
    "Time (Year/Month)",
    "Prices, Consumer Price Index, All items, Index",
    "Exchange Rates, National Currency Per U.S. Dollar, Period Average, Rate"
]].copy()

# Convert the time column to datetime (format like "Dec 1959")
germany_data["Time (Year/Month)"] = pd.to_datetime(
    germany_data["Time (Year/Month)"], format='%b %Y'
)

# Rename columns for convenience
germany_data.rename(columns={
    "Prices, Consumer Price Index, All items, Index": "CPI_GER",
    "Exchange Rates, National Currency Per U.S. Dollar, Period Average, Rate": "EXCH_GER"
}, inplace=True)

# Sort by time
germany_data.sort_values("Time (Year/Month)", inplace=True)

# Calculate monthly log changes:
#   inflation_ger = ln(CPI_GER_t) - ln(CPI_GER_{t-1})
germany_data["inflation_ger"] = np.log(germany_data["CPI_GER"]).diff()

#   nominal_exch_rate_growth = ln(EXCH_GER_t) - ln(EXCH_GER_{t-1})
germany_data["nominal_exch_rate_growth"] = np.log(germany_data["EXCH_GER"]).diff()

# ---------------------------
# 2) Filter and prepare US data
# ---------------------------
# We assume the US rows have a column "Prices, Consumer Price Index, All items, Index" for the US CPI.
us_data = merged_df.loc[merged_df["Country"] == "USA", [
    "Time (Year/Month)",
    "Prices, Consumer Price Index, All items, Index"
]].copy()

us_data["Time (Year/Month)"] = pd.to_datetime(
    us_data["Time (Year/Month)"], format='%b %Y'
)

us_data.rename(columns={
    "Prices, Consumer Price Index, All items, Index": "CPI_US"
}, inplace=True)

us_data.sort_values("Time (Year/Month)", inplace=True)

# Calculate monthly log changes for US inflation
us_data["inflation_us"] = np.log(us_data["CPI_US"]).diff()

# ---------------------------
# 3) Merge Germany with US data by month
# ---------------------------
merged_ger_us = pd.merge(
    germany_data,
    us_data[["Time (Year/Month)", "inflation_us"]],  # only need US inflation
    on="Time (Year/Month)",
    how="inner"  # or 'left' if you want all Germany months even if US is missing
)

# ---------------------------
# 4) Compute monthly real exchange rate growth
# ---------------------------
# Using the approximate log-change formula:
#   real_exch_rate_growth = nominal_exch_rate_growth + inflation_ger - inflation_us
merged_ger_us["real_exch_rate_growth"] = (
    merged_ger_us["nominal_exch_rate_growth"] 
    + merged_ger_us["inflation_ger"] 
    - merged_ger_us["inflation_us"]
)

# Multiply by 100 to get percentage terms:
merged_ger_us["real_exch_rate_growth"] *= 100

# ---------------------------
# 5) Create a final DataFrame with only the time and real exchange rate growth
# ---------------------------
german_change_in_real_exchange_rate = merged_ger_us[[
    "Time (Year/Month)", 
    "real_exch_rate_growth"
]].copy()

# 6.)  Reset the index for neatness
german_change_in_real_exchange_rate.reset_index(drop=True, inplace=True)


german_change_in_real_exchange_rate


Unnamed: 0,Time (Year/Month),real_exch_rate_growth
0,1959-12-01,
1,1960-01-01,0.340716
2,1960-02-01,-0.910483
3,1960-03-01,0.000000
4,1960-04-01,0.230208
...,...,...
368,1990-08-01,-4.916377
369,1990-09-01,-0.585995
370,1990-10-01,-2.881808
371,1990-11-01,-2.841436


#### 3. An index of the real exchange rate (setting the real exchange rate for December 1990 = 1)


In [27]:
import pandas as pd
import numpy as np

# ---------------------------
# 1) Filter and prepare Germany data
# ---------------------------
germany_data = merged_df.loc[
    merged_df["Country"] == "Germany",
    ["Time (Year/Month)",
     "Prices, Consumer Price Index, All items, Index",
     "Exchange Rates, National Currency Per U.S. Dollar, Period Average, Rate"]
].copy()

# Convert the date strings (like "Dec 1990") to actual datetime objects
germany_data["Time (Year/Month)"] = pd.to_datetime(
    germany_data["Time (Year/Month)"], format='%b %Y'
)

# Rename columns for clarity
germany_data.rename(columns={
    "Prices, Consumer Price Index, All items, Index": "CPI_GER",
    "Exchange Rates, National Currency Per U.S. Dollar, Period Average, Rate": "EXCH_GER"
}, inplace=True)

# Sort chronologically
germany_data.sort_values("Time (Year/Month)", inplace=True)

# ---------------------------
# 2) Filter and prepare US data
# ---------------------------
us_data = merged_df.loc[
    merged_df["Country"] == "USA",
    ["Time (Year/Month)",
     "Prices, Consumer Price Index, All items, Index"]
].copy()

us_data["Time (Year/Month)"] = pd.to_datetime(
    us_data["Time (Year/Month)"], format='%b %Y'
)
us_data.rename(columns={
    "Prices, Consumer Price Index, All items, Index": "CPI_US"
}, inplace=True)

us_data.sort_values("Time (Year/Month)", inplace=True)

# ---------------------------
# 3) Merge the two DataFrames by month
# ---------------------------
ger_us_merged = pd.merge(
    germany_data,
    us_data[["Time (Year/Month)", "CPI_US"]],
    on="Time (Year/Month)",
    how="inner"
)

# ---------------------------
# 4) Compute the Real Exchange Rate
# ---------------------------
# RER_t = EXCH_GER_t * (CPI_US_t / CPI_GER_t)
ger_us_merged["real_exchange_rate"] = (
    ger_us_merged["EXCH_GER"] * (ger_us_merged["CPI_US"] / ger_us_merged["CPI_GER"])
)

# ---------------------------
# 5) Normalize so that December 1990 = 1
# ---------------------------
base_date = pd.to_datetime("Dec 1990", format='%b %Y')
base_row = ger_us_merged.loc[ger_us_merged["Time (Year/Month)"] == base_date, "real_exchange_rate"]

if len(base_row) == 0:
    raise ValueError("No data found for December 1990 in Germany's real exchange rate data.")

base_value = base_row.values[0]

# real_exchange_rate_index = real_exchange_rate / base_value
ger_us_merged["real_exchange_rate_index"] = ger_us_merged["real_exchange_rate"] / base_value

# ---------------------------
# 6) Create a final DataFrame with only time + the index
# ---------------------------
german_real_exchange_rate_index = ger_us_merged[[
    "Time (Year/Month)",
    "real_exchange_rate_index"
]].copy()

german_real_exchange_rate_index.reset_index(drop=True, inplace=True)


german_real_exchange_rate_index


Unnamed: 0,Time (Year/Month),real_exchange_rate_index
0,1959-12-01,1.713301
1,1960-01-01,1.707474
2,1960-02-01,1.723091
3,1960-03-01,1.723091
4,1960-04-01,1.719129
...,...,...
368,1990-08-01,1.045085
369,1990-09-01,1.049889
370,1990-10-01,1.017646
371,1990-11-01,0.997672


#### 4. German monthly inflation rate

In [28]:
import pandas as pd

# --- Filter for Germany's CPI Data ---
germany_data = merged_df.loc[
    merged_df["Country"] == "Germany",
    ["Time (Year/Month)", "Prices, Consumer Price Index, All items, Index"]
].copy()

# --- Convert Time Column to Datetime ---
# Here, we assume the time strings are like "Dec 1990"
germany_data["Time (Year/Month)"] = pd.to_datetime(
    germany_data["Time (Year/Month)"], format='%b %Y'
)

# --- Sort the DataFrame Chronologically ---
germany_data.sort_values("Time (Year/Month)", inplace=True)

# --- Compute the Monthly Inflation Rate (Arithmetic Change) ---
# The inflation rate is computed as: ((CPI_t - CPI_{t-1}) / CPI_{t-1}) * 100
germany_data["monthly_inflation_rate"] = (
    (germany_data["Prices, Consumer Price Index, All items, Index"].diff() /
     germany_data["Prices, Consumer Price Index, All items, Index"].shift(1)) * 100
)

# --- Create a Final DataFrame with Only Time and Inflation Rate ---
german_monthly_inflation = germany_data[[
    "Time (Year/Month)", 
    "monthly_inflation_rate"
]].copy()

# Reset index for neatness
german_monthly_inflation.reset_index(drop=True, inplace=True)

german_monthly_inflation


Unnamed: 0,Time (Year/Month),monthly_inflation_rate
0,1959-12-01,
1,1960-01-01,0.000000
2,1960-02-01,-0.568147
3,1960-03-01,0.000000
4,1960-04-01,0.571393
...,...,...
368,1990-08-01,0.311532
369,1990-09-01,0.310565
370,1990-10-01,0.722391
371,1990-11-01,-0.204922


#### 5. The monthly growth in industrial production

In [30]:
import pandas as pd

# --- Filter for Germany's Industrial Production Data ---
# We assume that the merged_df DataFrame contains a column for the industrial production index
# labeled "Economic Activity, Industrial Production, Index".
germany_industrial = merged_df.loc[
    merged_df["Country"] == "Germany",
    ["Time (Year/Month)", "Economic Activity, Industrial Production, Index"]
].copy()

# --- Convert the "Time (Year/Month)" Column to Datetime ---
# Here we assume the time is formatted like "Dec 1990"
germany_industrial["Time (Year/Month)"] = pd.to_datetime(
    germany_industrial["Time (Year/Month)"], format='%b %Y'
)

# --- Sort the DataFrame Chronologically ---
germany_industrial.sort_values("Time (Year/Month)", inplace=True)

# --- Calculate the Monthly Growth in Industrial Production ---
# Compute the percentage change from one month to the next using arithmetic change.
# Formula: ((IP_t - IP_{t-1}) / IP_{t-1}) * 100
germany_industrial["german_monthly_industrial_production_growth"] = (
    (germany_industrial["Economic Activity, Industrial Production, Index"].diff() /
     germany_industrial["Economic Activity, Industrial Production, Index"].shift(1)) * 100
)

# --- Create a Final DataFrame with Only the Time and the Growth Rate ---
german_monthly_industrial_production_growth = germany_industrial[[
    "Time (Year/Month)",
    "german_monthly_industrial_production_growth"
]].copy()

german_monthly_industrial_production_growth.reset_index(drop=True, inplace=True)

# Display the resulting DataFrame
german_monthly_industrial_production_growth


Unnamed: 0,Time (Year/Month),german_monthly_industrial_production_growth
0,1959-12-01,
1,1960-01-01,-4.019729
2,1960-02-01,-0.488181
3,1960-03-01,3.743868
4,1960-04-01,6.470881
...,...,...
368,1990-08-01,-3.590683
369,1990-09-01,13.863684
370,1990-10-01,7.107116
371,1990-11-01,-3.434553


#### 6. The annual growth in industrial production

In [32]:
import pandas as pd

# --- Step 1: Filter and prepare Germany's Industrial Production Data ---
# We assume 'merged_df' (from previous steps) contains the column:
# "Economic Activity, Industrial Production, Index"
germany_industrial = merged_df.loc[
    merged_df["Country"] == "Germany",
    ["Time (Year/Month)", "Economic Activity, Industrial Production, Index"]
].copy()

# Convert the "Time (Year/Month)" column from strings (e.g. "Dec 1990") to datetime.
germany_industrial["Time (Year/Month)"] = pd.to_datetime(
    germany_industrial["Time (Year/Month)"], format='%b %Y'
)

# Sort the data by time to ensure the calculations are in order.
germany_industrial.sort_values("Time (Year/Month)", inplace=True)

# --- Step 2: Compute the Annual Industrial Production Growth ---
# For each month, the annual growth is calculated as:
#    ((IP_t - IP_{t-12}) / IP_{t-12}) * 100
# For the first 12 months, this will result in missing values.
germany_industrial["annual_growth"] = (
    (germany_industrial["Economic Activity, Industrial Production, Index"] -
     germany_industrial["Economic Activity, Industrial Production, Index"].shift(12))
    / germany_industrial["Economic Activity, Industrial Production, Index"].shift(12)
) * 100

# --- Step 3: Fill Gaps via Interpolation ---
# Replace any missing values (including the initial 12 values and any gaps due to outliers or missing data)
# with a linear interpolation. 'limit_direction="both"' ensures that missing values at the start and end are also filled.
germany_industrial["annual_growth_filled"] = germany_industrial["annual_growth"].interpolate(method='linear', limit_direction='both')

# --- Step 4: Create the Final DataFrame ---
# We extract only the "Time (Year/Month)" and the gap-filled annual growth rate,
# and rename the growth column to "german_annual_industrial_production_growth".
german_annual_industrial_production_growth = germany_industrial[[
    "Time (Year/Month)", "annual_growth_filled"
]].copy()

german_annual_industrial_production_growth.rename(
    columns={"annual_growth_filled": "german_annual_industrial_production_growth"},
    inplace=True
)

# Reset the index for neatness.
german_annual_industrial_production_growth.reset_index(drop=True, inplace=True)

german_annual_industrial_production_growth

Unnamed: 0,Time (Year/Month),german_annual_industrial_production_growth
0,1959-12-01,10.900123
1,1960-01-01,10.900123
2,1960-02-01,10.900123
3,1960-03-01,10.900123
4,1960-04-01,10.900123
...,...,...
368,1990-08-01,6.017897
369,1990-09-01,5.514275
370,1990-10-01,6.152998
371,1990-11-01,5.580779


#### 7. An index of the value of international reserves (value of reserves for January 1960 = 100)

In [34]:
import pandas as pd

# --- Step 1: Filter for Germany's International Reserves Data ---
# We assume merged_df has the column "International Reserves and Liquidity, Reserves, Official Reserve Assets, US Dollar"
germany_reserves = merged_df.loc[
    merged_df["Country"] == "Germany",
    ["Time (Year/Month)", "International Reserves and Liquidity, Reserves, Official Reserve Assets, US Dollar"]
].copy()

# --- Step 2: Convert the Time Column to Datetime ---
# Assuming time strings like "Jan 1960", we use format '%b %Y'
germany_reserves["Time (Year/Month)"] = pd.to_datetime(
    germany_reserves["Time (Year/Month)"], format='%b %Y'
)

# --- Step 3: Sort Data Chronologically ---
germany_reserves.sort_values("Time (Year/Month)", inplace=True)

# --- Step 4: Compute the Real Reserves Index ---
# We set the base period as January 1960, where the index is defined to be 100.
base_date = pd.to_datetime("Jan 1960", format='%b %Y')
base_value_series = germany_reserves.loc[
    germany_reserves["Time (Year/Month)"] == base_date,
    "International Reserves and Liquidity, Reserves, Official Reserve Assets, US Dollar"
]

if base_value_series.empty:
    raise ValueError("No data found for January 1960 in the international reserves column.")

base_value = base_value_series.iloc[0]

# Create the index: for each month, index = (current value / base value) * 100.
germany_reserves["international_reserves_index"] = (
    germany_reserves["International Reserves and Liquidity, Reserves, Official Reserve Assets, US Dollar"] / base_value
) * 100

# --- Step 5: Create a Final DataFrame with Only the Time and the Computed Index ---
german_international_reserves_index = germany_reserves[[
    "Time (Year/Month)",
    "international_reserves_index"
]].copy()

german_international_reserves_index.reset_index(drop=True, inplace=True)

german_international_reserves_index


Unnamed: 0,Time (Year/Month),international_reserves_index
0,1959-12-01,101.848342
1,1960-01-01,100.000000
2,1960-02-01,101.740143
3,1960-03-01,105.128964
4,1960-04-01,110.837171
...,...,...
368,1990-08-01,1533.093782
369,1990-09-01,1549.431834
370,1990-10-01,1587.837706
371,1990-11-01,1612.268481


---


## Variables for the USA

#### 1. The US monthly inflation rate

In [36]:
import pandas as pd

# --- Filter for US CPI Data ---
# Extract only the columns for time and US CPI from the merged DataFrame.
us_data = merged_df.loc[
    merged_df["Country"] == "USA",
    ["Time (Year/Month)", "Prices, Consumer Price Index, All items, Index"]
].copy()

# --- Convert the "Time (Year/Month)" Column to Datetime ---
# Here, we assume the time strings are formatted like "Dec 1959".
us_data["Time (Year/Month)"] = pd.to_datetime(us_data["Time (Year/Month)"], format='%b %Y')

# --- Sort the DataFrame Chronologically ---
us_data.sort_values("Time (Year/Month)", inplace=True)

# --- Calculate the Monthly Inflation Rate (Arithmetic Change) ---
# The formula used is:
#    Inflation_t = ((CPI_t - CPI_{t-1}) / CPI_{t-1}) * 100
us_data["monthly_inflation_rate"] = (
    (us_data["Prices, Consumer Price Index, All items, Index"].diff() / 
     us_data["Prices, Consumer Price Index, All items, Index"].shift(1)) * 100
)

# --- Create a Final DataFrame with Only the Time and the Computed Inflation Rate ---
us_monthly_inflation = us_data[[
    "Time (Year/Month)",
    "monthly_inflation_rate"
]].copy()

# Optionally, reset the index for neatness
us_monthly_inflation.reset_index(drop=True, inplace=True)

us_monthly_inflation


Unnamed: 0,Time (Year/Month),monthly_inflation_rate
0,1959-12-01,
1,1960-01-01,-0.340136
2,1960-02-01,0.341297
3,1960-03-01,0.000000
4,1960-04-01,0.340136
...,...,...
368,1990-08-01,0.920245
369,1990-09-01,0.835866
370,1990-10-01,0.602864
371,1990-11-01,0.224719


#### 2. An index of the value of international reserves (value of reserves for January 1960 = 100)

In [37]:
import pandas as pd

# --- Step 1: Filter for US International Reserves Data ---
# We assume merged_df has the column "International Reserves and Liquidity, Reserves, Official Reserve Assets, US Dollar"
us_reserves = merged_df.loc[
    merged_df["Country"] == "USA",
    ["Time (Year/Month)", "International Reserves and Liquidity, Reserves, Official Reserve Assets, US Dollar"]
].copy()

# --- Step 2: Convert the Time Column to Datetime ---
# Assuming time strings like "Jan 1960", we use format '%b %Y'
us_reserves["Time (Year/Month)"] = pd.to_datetime(us_reserves["Time (Year/Month)"], format='%b %Y')

# --- Step 3: Sort the DataFrame Chronologically ---
us_reserves.sort_values("Time (Year/Month)", inplace=True)

# --- Step 4: Find the Base Value for January 1960 ---
base_date = pd.to_datetime("Jan 1960", format='%b %Y')
base_value_series = us_reserves.loc[
    us_reserves["Time (Year/Month)"] == base_date, 
    "International Reserves and Liquidity, Reserves, Official Reserve Assets, US Dollar"
]

if base_value_series.empty:
    raise ValueError("No data found for January 1960 in the US international reserves column.")

base_value = base_value_series.iloc[0]

# --- Step 5: Compute the US International Reserves Index ---
# For each month, index = (current reserves / base value) * 100.
us_reserves["us_international_reserves_index"] = (
    us_reserves["International Reserves and Liquidity, Reserves, Official Reserve Assets, US Dollar"] / base_value
) * 100

# --- Step 6: Create a Final DataFrame with Only Time and the Computed Index ---
us_international_reserves_index = us_reserves[[
    "Time (Year/Month)", "us_international_reserves_index"
]].copy()

us_international_reserves_index.reset_index(drop=True, inplace=True)

us_international_reserves_index


Unnamed: 0,Time (Year/Month),us_international_reserves_index
0,1959-12-01,100.122916
1,1960-01-01,100.000000
2,1960-02-01,99.616353
3,1960-03-01,99.378902
4,1960-04-01,99.068353
...,...,...
368,1990-08-01,367.392080
369,1990-09-01,372.584941
370,1990-10-01,385.751982
371,1990-11-01,386.716715


---
# Identification of outliers

I am going to be identifying outliers by trying to find values greater than UQ + 1.5 * IQR or less than LQ - 1.5 * IQR. 

In [40]:
import pandas as pd

def detect_outliers(df, col):
    """
    Given a DataFrame and the name of a numeric column,
    computes Q1, Q3, IQR, lower bound, and upper bound
    based on the 1.5×IQR rule, and returns these along with
    a DataFrame of rows where the value in col is an outlier.
    """
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    return Q1, Q3, IQR, lower_bound, upper_bound, outliers

# Define a list of tests as tuples:
# (DataFrame, expected column name, descriptive name for printing)
tests = [
    (german_change_in_nominal_exchange_rate, "german_monthly_nominal_exchange_rate_growth", "German Nominal Exchange Rate Growth"),
    (german_change_in_real_exchange_rate, "real_exch_rate_growth", "German Real Exchange Rate Growth"),  # note the column name correction
    (german_international_reserves_index, "international_reserves_index", "German International Reserves Index"),
    (german_monthly_inflation, "monthly_inflation_rate", "German Monthly Inflation Rate"),
    (german_monthly_industrial_production_growth, "german_monthly_industrial_production_growth", "German Monthly Industrial Production Growth"),
    (german_annual_industrial_production_growth, "german_annual_industrial_production_growth", "German Annual Industrial Production Growth"),
    (us_monthly_inflation, "monthly_inflation_rate", "US Monthly Inflation Rate"),
    (us_international_reserves_index, "us_international_reserves_index", "US International Reserves Index")
]

# Loop over each test and print outlier information.
for df, col, desc in tests:
    print(f"Testing outliers for {desc}:")
    if col not in df.columns:
        print(f"  Column '{col}' not found. Available columns: {df.columns.tolist()}\n")
        continue
    Q1, Q3, IQR, lb, ub, out_df = detect_outliers(df, col)
    print(f"  Q1: {Q1}")
    print(f"  Q3: {Q3}")
    print(f"  IQR: {IQR}")
    print(f"  Lower Bound: {lb}")
    print(f"  Upper Bound: {ub}")
    if out_df.empty:
        print("  No outliers detected.\n")
    else:
        print("  Outliers detected:")
        print(out_df)
        print()
    print("="*60)


Testing outliers for German Nominal Exchange Rate Growth:
  Q1: -1.0893842553304807
  Q3: 0.36522746248554006
  IQR: 1.4546117178160207
  Lower Bound: -3.2713018320545117
  Upper Bound: 2.547145039209571
  Outliers detected:
    Time (Year/Month)  german_monthly_nominal_exchange_rate_growth
15         1961-03-01                                    -4.761905
119        1969-11-01                                    -8.500000
144        1971-12-01                                     4.152072
145        1972-01-01                                    -6.857094
158        1973-02-01                                    -5.854940
..                ...                                          ...
353        1989-05-01                                     4.271814
355        1989-07-01                                    -4.381847
358        1989-10-01                                    -4.409505
360        1989-12-01                                    -4.867523
368        1990-08-01                 