# **Data Prep**

In [1]:
from google.colab import files
uploaded = files.upload()

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv("screentime_analysis.csv")
print(df)

Saving screentime_analysis.csv to screentime_analysis (4).csv
           Date        App  Usage (minutes)  Notifications  Times Opened
0    2024-08-07  Instagram               81             24            57
1    2024-08-08  Instagram               90             30            53
2    2024-08-26  Instagram              112             33            17
3    2024-08-22  Instagram               82             11            38
4    2024-08-12  Instagram               59             47            16
..          ...        ...              ...            ...           ...
195  2024-08-10   LinkedIn               22             12             5
196  2024-08-23   LinkedIn                5              7             1
197  2024-08-18   LinkedIn               19              2             5
198  2024-08-26   LinkedIn               21             14             1
199  2024-08-02   LinkedIn               13              4             1

[200 rows x 5 columns]


In [2]:
print(df.isnull().sum())
print(df.duplicated().sum())

Date               0
App                0
Usage (minutes)    0
Notifications      0
Times Opened       0
dtype: int64
0


In [3]:
#Convert Datetime and extract features
df = pd.read_csv("screentime_analysis.csv")
df['Date']=pd.to_datetime(df['Date'])
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['Month']=df['Date'].dt.month

In [4]:
df = pd.get_dummies(df, columns=['App'], drop_first=True)

In [5]:
# scale numerical features using MinMaxScaler
scaler=MinMaxScaler()
df[['Notifications', 'Times Opened']] = scaler.fit_transform(df[['Notifications', 'Times Opened']])

# feature engineering
df['Previous_Day_Usage'] = df['Usage (minutes)'].shift(1)
df['Notifications_x_TimesOpened'] = df['Notifications'] * df['Times Opened']

# save the preprocessed data to a file
df.to_csv('preprocessed_screentime_analysis.csv', index=False)

The process scales numerical columns, such as **Notifications** and **Times Opened**, using MinMaxScaler to ensure uniformity. Feature engineering creates lagged **(Previous_Day_Usage)** and interaction **(Notifications_x_TimesOpened)** features to enhance predictive power.

# **Train The Model**

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

In [7]:
# split data into features and target variable
X = df.drop(columns=['Usage (minutes)', 'Date'])
y = df['Usage (minutes)']

In [8]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train the model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# evaluate the model
predictions = model.predict(X_test)
mae = mean_absolute_error(y_test, predictions)
print(f'Mean Absolute Error: {mae}')

Mean Absolute Error: 15.398500000000002


The **Mean Absolute Error (MAE)** metric quantifies the average difference between the predicted and actual values to assess performance.

--> On average, the model’s predicted screentime differs from the actual screentime by approximately 15.4 minutes --> the model performs reasonably well **><** still room for improvement in reducing this error to make predictions more precise.

# **Automating Preprocessing with a Pipeline using Apache Airflow**

In [9]:
pip install apache-airflow



In [12]:
from airflow import DAG
from airflow.providers.standard.operators.python import PythonOperator
from datetime import datetime

# define the data preprocessing function
def preprocess_data():
    file_path = 'screentime_analysis.csv'
    df = pd.read_csv(file_path)

    df['Date'] = pd.to_datetime(df['Date'])
    df['DayOfWeek'] = df['Date'].dt.dayofweek
    df['Month'] = df['Date'].dt.month

    df = df.drop(columns=['Date'])

    df = pd.get_dummies(df, columns=['App'], drop_first=True)

    scaler = MinMaxScaler()
    df[['Notifications', 'Times Opened']] = scaler.fit_transform(df[['Notifications', 'Times Opened']])

    preprocessed_path = 'preprocessed_screentime_analysis.csv'
    df.to_csv(preprocessed_path, index=False)
    print(f"Preprocessed data saved to {preprocessed_path}")

# define the DAG
dag = DAG(
    dag_id='data_preprocessing',
    schedule='@daily', # Changed schedule_interval to schedule
    start_date=datetime(2025, 1, 1),
    catchup=False,
)

# define the task
preprocess_task = PythonOperator(
    task_id='preprocess',
    python_callable=preprocess_data,
    dag=dag,
)

# **Testing and Running Pipeline**

In [14]:
!airflow db init

[91mUsage:[0m [37mairflow db[0m [[36m-h[0m] [36mCOMMAND[0m [36m...[0m

[39mDatabase operations[0m

[91mPositional Arguments:[0m
  [36mCOMMAND[0m
    [36mcheck[0m           [39mCheck if the database can be reached[0m
    [36mcheck-migrations[0m
                    [39mCheck if migration have finished[0m
    [36mclean[0m           [39mPurge old records in metastore tables[0m
    [36mdowngrade[0m       [39mDowngrade the schema of the metadata database.[0m
    [36mdrop-archived[0m   [39mDrop archived tables created through the db clean command[0m
    [36mexport-archived[0m
                    [39mExport archived data from the archive tables[0m
    [36mmigrate[0m         [39mMigrates the metadata database to the latest version[0m
    [36mreset[0m           [39mBurn down and rebuild the metadata database[0m
    [36mshell[0m           [39mRuns a shell to access the database[0m

[91mOptions:[0m
  [36m-h[0m, [36m--help[0m        [39msho

In [16]:
get_ipython().system('airflow webserver --port 8080')

[91mUsage:[0m [37mairflow[0m [[36m-h[0m] [36mGROUP_OR_COMMAND[0m [36m...[0m

[91mPositional Arguments:[0m
  [36mGROUP_OR_COMMAND[0m

[36m    Groups[0m
      [36massets[0m         [39mManage assets[0m
      [36mbackfill[0m       [39mManage backfills[0m
      [36mconfig[0m         [39mView configuration[0m
      [36mconnections[0m    [39mManage connections[0m
      [36mdags[0m           [39mManage DAGs[0m
      [36mdb[0m             [39mDatabase operations[0m
      [36mjobs[0m           [39mManage jobs[0m
      [36mpools[0m          [39mManage pools[0m
      [36mproviders[0m      [39mDisplay providers[0m
      [36mtasks[0m          [39mManage tasks[0m
      [36mvariables[0m      [39mManage variables[0m

[36m    Commands:[0m
      [36mapi-server[0m     [39mStart an Airflow API server instance[0m
      [36mcheat-sheet[0m    [39mDisplay cheat sheet[0m
      [36mdag-processor[0m  [39mStart a dag processor instance[0m
 

In [18]:
!airflow scheduler

[[34m2025-07-29T20:30:24.120+0000[0m] {[34mproviders_manager.py:[0m953} INFO[0m - The hook_class '[1mairflow.providers.standard.hooks.filesystem.FSHook[22m' is not fully initialized (UI widgets will be missing), because the 'flask_appbuilder' package is not installed, however it is not required for Airflow components to work[0m
[[34m2025-07-29T20:30:24.122+0000[0m] {[34mproviders_manager.py:[0m953} INFO[0m - The hook_class '[1mairflow.providers.standard.hooks.package_index.PackageIndexHook[22m' is not fully initialized (UI widgets will be missing), because the 'flask_appbuilder' package is not installed, however it is not required for Airflow components to work[0m

Please confirm database initialize (or wait 4 seconds to skip it). Are you sure? [y/N]
[[34m2025-07-29T20:30:28.369+0000[0m] {[34mdb.py:[0m916} INFO[0m - Log template table does not exist (added in 2.3.0); skipping log template sync.[0m
  ____________       _____________
 ____    |__( )_________  __/__ 