# Data Cleaning
---


**Objective:** to get the data for the following information for all countries from December 1959 to December 1990


1. Industrial production (Index)

2. Exchange rates, National Currency per US dollar (Period Average)

3. Consumer prices (All items), index

4. International Reserves and Liquidity (Reserves, Official Reserve Assets, US Dollar)

5. Data for consumer prices and international reserves for the United States only over the same time period.


---

# 1. Downloading the data

We collected the data from ['IMF data portal'](https://data.imf.org/?sk=4c514d48-b6ba-49ed-8ab9-52b0c1a0179b&sid=1390030341854) using the query function to get desired data

the desired data for Germany and the USA can be found in 2 seperate excel files in the data folder of the repository, titled Germany and the USA respectively



---
# 2. Cleaning the data

#### Importing and merging the 2 datasets

In [10]:
import pandas as pd

# Define the expected final column names for the data (for the 5 columns)
final_columns = [
    "Time (Year/Month)",
    "Economic Activity, Industrial Production, Index",
    "Exchange Rates, National Currency Per U.S. Dollar, Period Average, Rate",
    "International Reserves and Liquidity, Reserves, Official Reserve Assets, US Dollar",
    "Prices, Consumer Price Index, All items, Index"
]

def process_file(file_path, country, skiprows=2):
    """
    Reads an Excel file using the header from row 3 (Excel numbering) so that data starts on row 4.
    Then it renames the available columns with the expected names and reindexes the DataFrame to 
    ensure it has 5 columns (adding NaN for missing ones).
    Finally, it adds a 'Country' column.
    """
    # Read the file: header row is now the first row after skipping the first 2 rows.
    df = pd.read_excel(file_path, header=0, skiprows=skiprows)
    
    # Get the number of columns that were actually read
    n_cols = df.shape[1]
    
    # Rename the columns of the available ones using the corresponding names from final_columns
    df.columns = final_columns[:n_cols]
    
    # Reindex the DataFrame to force it to have exactly the final_columns;
    # missing columns will be filled with NaN.
    df = df.reindex(columns=final_columns)
    
    # Add the 'Country' column with the provided country name
    df["Country"] = country
    
    return df

# Process the USA file (adjust the file path if needed)
usa_df = process_file("../data/USA.xlsx", "USA", skiprows=2)

# Process the Germany file
germany_df = process_file("../data/Germany.xlsx", "Germany", skiprows=2)

# Merge the two DataFrames (stack row-wise)
merged_df = pd.concat([usa_df, germany_df], ignore_index=True)

merged_df


Unnamed: 0,Time (Year/Month),"Economic Activity, Industrial Production, Index","Exchange Rates, National Currency Per U.S. Dollar, Period Average, Rate","International Reserves and Liquidity, Reserves, Official Reserve Assets, US Dollar","Prices, Consumer Price Index, All items, Index",Country
0,Dec 1959,21504.500000,13.482806,,,USA
1,Jan 1960,21478.100000,13.436946,,,USA
2,Feb 1960,21395.700000,13.482806,,,USA
3,Mar 1960,21344.700000,13.482806,,,USA
4,Apr 1960,21278.000000,13.528666,,,USA
...,...,...,...,...,...,...
741,Aug 1990,75.964955,1.570700,72425.738573,67.556702,Germany
742,Sep 1990,86.496497,1.569700,73197.573621,67.766509,Germany
743,Oct 1990,92.643903,1.523300,75011.926830,68.256049,Germany
744,Nov 1990,89.461999,1.487000,76166.074709,68.116177,Germany
