# Predicting Students' Dropout and Academic Success

This project aims to predict student dropout rates and academic success using various machine learning techniques. The dataset used for this analysis is sourced from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success).

## Project Outline

1. **Download and Load the Data:**
   - Download the dataset and load it into a Pandas DataFrame.

2. **Explore and Preprocess the Data:**
   - Understand the features and target variable.
   - Handle missing values and outliers.
   - Encode categorical variables.
   - Normalize/standardize numerical features.
   
   Please click this link to access:[EDA Notebook](./Predict_Students_Dropout_EDA.ipynb).

3. **Split the Data:**
   - Split the dataset into training and testing sets.

4. **Build Classification Models:**
   - Train multiple classification models such as Logistic Regression, Random Forest, and SVM.
   - Evaluate their performance using metrics like accuracy, precision, recall, and F1-score.

5. **Feature Selection and Hyperparameter Tuning:**
   - Perform feature selection to identify important features.
   - Tune hyperparameters using Grid Search or Random Search.

6. **Evaluate and Compare Models:**
   - Compare the models based on their performance metrics.
   - Choose the best-performing model.

7. **Model Interpretation and Insights:**
   - Interpret the model to understand which features are most influential.
   - Provide insights and recommendations based on the findings.

In [29]:
""" install the required package to fetch the dataset for this project
from the UCI Machine Learning Repository """

#%pip install ucimlrepo
#%pip install xgboost
#%pip install cmake
#%pip install catboost

' install the required package to fetch the dataset for this project\nfrom the UCI Machine Learning Repository\xa0'

In [1]:
#Import required packages

from ucimlrepo import fetch_ucirepo

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import RobustScaler
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import resample

from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import cross_val_score

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

In [5]:
# Load the dataset
df = pd.read_csv('data/prep_data.csv')