# **Step 1: Setting Up the Environment**


**Login and create a new notebook**

Required Libraries:
pandas and scikit-learn

*   Colab already has **pandas** and **scikit-learn** pre-installed


Library: **pandas** for data manipulation

Library: **scikit-learn** to implement the model

# **Step 2: Importing Libraries and Loading the Dataset**

### **2.1 Import the Libraries**

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

**Explanation:**

* pandas: to handle and analyze the dataset.
* numpy: For numerical operations.
* LinearRegression: To build our supervised learning model.
* train_test_split: To split the dataset into training and testing sets.
* mean_squared_error and r2_score: To evaluate the model’s performance.

### **2.2 Load the Dataset**

Click the Files icon on the left panel and select Upload


In [None]:
data = pd.read_csv('/content/economic_data.csv')  # Make sure the path matches your file's name
data.head()


Unnamed: 0,date,GDP_growth,unemployment_rate,inflation_rate
0,2000-03-31,0.242723,7.378637,2.899741
1,2000-06-30,4.751796,4.909326,3.359057
2,2000-09-30,-1.637014,4.972832,2.771238
3,2000-12-31,1.120808,7.710629,3.457831
4,2001-03-31,0.450201,5.26427,4.618121


**Explanation:**

Reads the CSV file and shows the first few rows to confirm the data has been loaded correctly.


# **STEP 3: Data Exploration and Preprocessing**



### **3.1. Explore the Dataset**

In [None]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   date               100 non-null    object 
 1   GDP_growth         100 non-null    float64
 2   unemployment_rate  100 non-null    float64
 3   inflation_rate     100 non-null    float64
dtypes: float64(3), object(1)
memory usage: 3.2+ KB


**Explanation:**

Shows the data types and if there are any missing values.

In [None]:
data.describe()

Unnamed: 0,GDP_growth,unemployment_rate,inflation_rate
count,100.0,100.0,100.0
mean,1.541926,6.540095,2.7196
std,2.065896,2.151308,1.402203
min,-1.947568,3.132265,0.031214
25%,-0.174888,4.602727,1.555644
50%,1.372377,6.722063,2.73095
75%,3.201292,8.801613,3.823869
max,4.914343,9.90343,4.959457


**Explanation:**

To see basic statistics.

### **3.2 Data Cleaning**



In [None]:
data = data.dropna()

**Explanation:**

Removes any rows where there are missing values.

### **3.3 Feature Selection**

Identify the columns (features) that you’ll use for prediction.

For example, we might want to predict GDP growth (GDP_growth) using the unemployment rate (unemployment_rate) and inflation rate (inflation_rate):

In [None]:
features = ['unemployment_rate', 'inflation_rate']
target = 'GDP_growth'
X = data[features]
y = data[target]


**Explanation:**

X is the input features (what we use to predict).
y is the target variable (what we want to predict).

# **Step 4: Building the Supervised Learning Model**

### **4.1 Splitting the Dataset**

Split the data into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


**Explanation:**

X_train and y_train: Data used to train the model (80% of the dataset).

X_test and y_test: Data used to test the model (20% of the dataset).

### **4.2 Training the Model**

Create an instance of the Linear Regression model and train it using the training data.

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)


**Explanation:**

fit() is the method that trains the model using X_train and y_train.

### **4.3 Making Predictions**

Using the model to make predictions on the test data.

In [None]:
predictions = model.predict(X_test)


**Explanation:**

This code predicts GDP growth using the test set and stores the predictions in the variable predictions.

# **Step 5: Evaluating the Model**

### **5.1 Calculate Performance Metrics**

Measure how well the model performs using R-squared and Mean Squared Error (MSE)

In [None]:
r2 = r2_score(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
print(f'R-squared: {r2}')
print(f'Mean Squared Error: {mse}')


**Explanation:**

R-squared tells us how much of the variation in GDP growth the model explains. The closer it is to 1, the better.

MSE measures the average squared differences between predicted and actual values. A lower value indicates a better model.

### **5.2 Interpreting the Results**

Look at the values of R-squared and MSE. Discuss with your peers whether these values indicate a good or poor model performance.

# **Step 6: Improving the Model**

### **6.1 Feature Engineering**

Add new features that may improve the model’s performance. For example, you might add lag values of the unemployment rate to capture trends.

In [None]:
data['unemployment_rate_lag1'] = data['unemployment_rate'].shift(1)
X = data[['unemployment_rate', 'inflation_rate', 'unemployment_rate_lag1']].dropna()


**Explanation:**

This code adds a new feature, the lagged unemployment rate, which may provide additional insight.

### **6.2 Applying Regularization (Ridge Regression)**

Use Ridge regression to see if it improves the model by preventing overfitting.

In [None]:
from sklearn.linear_model import Ridge
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
ridge_predictions = ridge_model.predict(X_test)


**Explanation:**

Ridge regression includes a penalty to prevent the model from becoming too complex.

### **6.3 Cross-Validation**

Use cross-validation to validate the model’s robustness.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
import pandas as pd

# Make sure X and y are properly set up with no missing values and are numeric
X = data[['unemployment_rate', 'inflation_rate']].dropna()
y = data['GDP_growth'].dropna()

# Initialize the model
model = LinearRegression()

# Apply cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f'Cross-validation scores: {scores}')
print(f'Average Cross-validation score: {scores.mean()}')


**Explanation:**

Cross-validation helps ensure that the model performs well on different subsets of data.

# **Step 7: Reflection and Discussion**


**Share Results**

Discuss the performance metrics you obtained. Did different features or models (e.g., Ridge) improve the results?

**Identify Challenges**

Talk about any difficulties you encountered. Did you have trouble with missing values or selecting features?

**Applications in Real Life**

Reflect on how such models could be useful in predicting economic growth and aiding decision-making in economics.