<a href="https://www.kaggle.com/code/sai10py/students-media-addiction-analysis-and-prediction?scriptVersionId=243644320" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 1 | Description and Goal
## 1.1 | Dataset Description
The Student Social Media & Relationships dataset contains students' social-media behaviours and records with related life outcomes. 

## 1.2 | Goal
Need to analyse student behaviours and predict addiction scores based on related factors.

## 1.3 | Tasks
- Dataset overview
- Data Analysis
- Prediction Model

# 2 | Importing Initial Dependencies

In [None]:
import pandas as pd # data manipulation and analysis
import numpy as np # numerical computing
import matplotlib.pyplot as plt # static visualizations
import seaborn as sns # statistical graphics

In [None]:
# load dataset
df = pd.read_csv("/kaggle/input/social-media-addiction-vs-relationships/Students Social Media Addiction.csv", index_col = "Student_ID")

In [None]:
# First 5 Rows
df.head()

In [None]:
# Dataset Information
df.info()

In [None]:
# Dataset statistics for Numerical columns
df.describe()

In [None]:
# Dataset statistics for Categorical columns
df.describe(include="object")

# 3 | Understanding Columns

`Student_ID`

Unique respondent indentifier for each student.


`Age`

Age of the student (in years).


`Gender`

Categorical column with values "Male" or "Female"


`Academic_Level` 

Current degree of education with values High School / Undergraduate / Graduate


`Country`

Country of residence of respondent


`Avg_Daily_Usage_Hours`

Average hours per day on social media platforms


`Most_Used_Platform`

Most used social media platform. May contain values like Instagram, Facebook, TikTok, etc.


`Affects_Academic_Performance`

Self-reported impace on academics. Answers question "is social media affecting academic performance?".


`Sleep_Hours_Per_Night`

Average nightly sleep hours. Analysis of this is important because users tend to use social media more at night.


`Mental_Health_Score`

Self-rated mental health score of the respondent.


`Relationship_Status`

Single/ In Relationship / Complicated


`Conflicts_Over_Social_Media`

Number of relationship conflicts due to social media.


`Addicted_Score`

Social Media Addiction Score (1 = Low to 10 = High)





# 4 | Data Analysis
Analysing columns and their relation with the `Addicted_Score` column

## 4.1 | Checking Gender bias
checking whether the number of Male and Female values in Gender column are equal or not. This is important validate proper survey.

In [None]:
print(df["Gender"].value_counts())
sns.countplot(x = "Gender", data = df)

There is no gender bias in data which makes it relevant for further analysis.

## 4.2 | Count plots for Cat Columns

In [None]:
# Looping structure for visualizations
for col in df.columns:
    if df[col].dtype == "object":
        print(df[col].value_counts())
        
        plt.show(sns.countplot(x = col, data = df))
        plt.xticks(rotation = 45)

## 4.3 | Key Observations and Inferences

**Observations**

- Most of the respondents were *Undergraduate* and *Graduate*.
- Out of 110 Countries, India and USA respondents are greatest in numbers.
- *Instagram*, *TikTok*, *Facebook*, *Whatsapp* and *Twitter* are most frequently used social media platforms, taken in order.
- Social Media affected academics for nearly one-third respondents.
- Single relationship status was more frequent among respondents.

**Inferences**

- Frequency of *Undergraduate* and *Graduate* may indicate that student below UG were not given access to social media or survey didn't reach them.
- Students from countries like *India* and *USA* are frequent due to resource accessibility. This may also make them more prone to social media and its consequences.
- *Instagram* and *TikTok* are most frequently used social media platforms. Their short video (reel) feature may attract respondents and affect their performance.
- Attractive social media features could be time consuming and energy draining. Also attention span and creativity would be affected.
- Single relationship status and high usage of social media indicate respondent's lack of commitment towards their academic or personal goals which would definitely affect acad performance

## 4.4 | Other Distributions

In [None]:
# Age distribution
# histogram = sns.histplot(x = "Age", data = df, kde = True)
box = sns.boxplot(x = "Age", data = df)
plt.show(box)

The maximum and minimum age of respondent is 18 and 24 respectively with median age 21.

In [None]:
sns.histplot(x = "Avg_Daily_Usage_Hours", data = df, kde = True)

In [None]:
sns.histplot(x = "Sleep_Hours_Per_Night", data = df, kde = True)

**Inferences**
The numerical graphs seems well normalized and doesn't require attention for now.

## 4.5 | Feature Relevance

Before checking the correlation using heatmaps we need to encode the categorical columns

In [None]:
from sklearn.preprocessing import LabelEncoder

# Since we don't have separate test data
# We only have to encode once. Therefore below for loop is justified
encoder = LabelEncoder()
for col in df.columns:
    if df[col].dtype == "object":
        df[col] = encoder.fit_transform(df[col])
        
df.head() 

In [None]:
# Correlation Heatmap
correlation = df.corr()
plt.figure(figsize = (10, 10))
sns.heatmap(correlation, annot = True, fmt = '.2f')

### Observations
- `Age`, `Gender`, `Academic_Level` and `Relationship_Status` are columns with least direct relevance to `Addicted_Score` column
- `Country` and `Most_Used_Platform` are columns show low relevance.
- `Avg_Daily_Usage_Hours`, `Affects_Academic_Performance` and `Conflicts_Over_Social_Media` show high direct relevance.
- `Sleep_Hours_Per_Night` and `Mental_Health_Score` show high relevance in inverse fashion.

**NOTE** 
There could be many more observations made like Mental Health Score is inversely related to Conflicts Over Social Media. For this version only important things are considered

# 5 | Model Training

## 5.1 | Splitting into X and y

In [None]:
X = df.drop(columns = "Addicted_Score")
y = df["Addicted_Score"]

## 5.2 | Splitting into Train and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

## 5.3 | Baseline Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

train_preds = model.predict(X_train)
test_preds = model.predict(X_test)

#  for Baseline Model
print("*"*10, " Mean Absolute Error ", "*"*10)
print("Training MAE: ", mean_absolute_error(train_preds, y_train))
print("Test MAE: ", mean_absolute_error(test_preds, y_test))
print()
print("*"*10, " Root Mean Squared Error ", "*"*10)
print("Training RMSE: ", np.sqrt(mean_squared_error(train_preds, y_train)))
print("Test RMSE: ", np.sqrt(mean_squared_error(test_preds, y_test)))
print()
print("*"*10, " R2 Score ", "*"*10)
print("Training MAE: ", r2_score(train_preds, y_train))
print("Test MAE: ", r2_score(test_preds, y_test))