# 1. Problem Understanding

We want to predict next-day new COVID-19 cases for a selected country using past case data.

This is a time series regression problem, where:

* Input → past day cases (lag features)

* Output → next day cases

# 2️. Data Source

Our World in Data – COVID-19 Dataset

# 3️. Choose a Country

Example country: India
(You can change it to US, UK, etc.)

# 4️. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split


# 5️. Load and Filter the Data

In [13]:
df = pd.read_csv(r"C:\Users\telug\Downloads\owid-covid-data.csv (1).zip")

df = df[df["location"] == "India"]

df = df[["date", "new_cases"]]

df["date"] = pd.to_datetime(df["date"])

df["new_cases"] = df["new_cases"].fillna(0)

df.head()


Unnamed: 0,date,new_cases
18696,2019-12-31,0.0
18697,2020-01-01,0.0
18698,2020-01-02,0.0
18699,2020-01-03,0.0
18700,2020-01-04,0.0


# 6️. Create Lag Features (Past Cases)

Lag features mean using previous days’ cases to predict today’s cases.

Example:

* lag_1 → yesterday

* lag_2 → 2 days ago

* lag_3 → 3 days ago

In [5]:
df["lag_1"] = df["new_cases"].shift(1)
df["lag_2"] = df["new_cases"].shift(2)
df["lag_3"] = df["new_cases"].shift(3)

# Drop rows with NaN values
df.dropna(inplace=True)

df.head()


Unnamed: 0,date,new_cases,lag_1,lag_2,lag_3
18699,2020-01-03,0.0,0.0,0.0,0.0
18700,2020-01-04,0.0,0.0,0.0,0.0
18701,2020-01-05,0.0,0.0,0.0,0.0
18702,2020-01-06,0.0,0.0,0.0,0.0
18703,2020-01-07,0.0,0.0,0.0,0.0


# 7️. Define Features and Target

In [6]:
X = df[["lag_1", "lag_2", "lag_3"]]
y = df["new_cases"]


# 8️. Train-Test Split (Time Series Friendly)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False
)


# 9️. Train Decision Tree Regression Model

In [8]:
model = DecisionTreeRegressor(max_depth=5, random_state=42)
model.fit(X_train, y_train)


# 10. Predict Next-Day New Cases

In [9]:
y_pred = model.predict(X_test)


# 1️1️. Model Evaluation (MAE & RMSE)

In [10]:
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("Mean Absolute Error (MAE):", mae)
print("Root Mean Squared Error (RMSE):", rmse)


Mean Absolute Error (MAE): 24433.56603773585
Root Mean Squared Error (RMSE): 28583.147948754304


# 1️2️. Predict Tomorrow’s Cases (Single Prediction)

In [12]:
last_days = df.iloc[[-1]][["lag_1", "lag_2", "lag_3"]]

tomorrow_cases = model.predict(last_days)
print("Predicted next-day new cases:", int(tomorrow_cases[0]))


Predicted next-day new cases: 47703


# 1️3️. Conclusion (For Project Report)
Insights:

* Lag features effectively capture COVID-19 trends

* Decision Tree can model non-linear patterns

* Prediction accuracy depends on pandemic volatility

Limitations:

* Sudden outbreaks or policy changes are not captured

* More lag features or advanced models (LSTM, ARIMA) can improve accuracy