# Predicting Students' Dropout and Academic Success

This project aims to predict student dropout rates and academic success using various machine learning techniques. The dataset used for this analysis is sourced from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success).

## Project Outline

1. **Download and Load the Data:**
   - Download the dataset and load it into a Pandas DataFrame.

2. **Explore and Preprocess the Data:**
   - Understand the features and target variable.
   - Handle missing values and outliers.
   - Encode categorical variables.
   - Normalize/standardize numerical features.
   
   Please click this link to access:[EDA Notebook](./Predict_Students_Dropout_EDA.ipynb).

3. **Split the Data:**
   - Split the dataset into training and testing sets.

4. **Build Classification Models:**
   - Train multiple classification models such as Logistic Regression, Random Forest, and SVM.
   - Evaluate their performance using metrics like accuracy, precision, recall, and F1-score.

5. **Feature Selection and Hyperparameter Tuning:**
   - Perform feature selection to identify important features.
   - Tune hyperparameters using Grid Search or Random Search.

6. **Evaluate and Compare Models:**
   - Compare the models based on their performance metrics.
   - Choose the best-performing model.

7. **Model Interpretation and Insights:**
   - Interpret the model to understand which features are most influential.
   - Provide insights and recommendations based on the findings.

In [29]:
""" install the required package to fetch the dataset for this project
from the UCI Machine Learning Repository """

#%pip install ucimlrepo
#%pip install Faker
#%pip install xgboost
#%pip install cmake
#%pip install catboost

' install the required package to fetch the dataset for this project\nfrom the UCI Machine Learning Repository\xa0'

In [6]:
#Import required packages

from ucimlrepo import fetch_ucirepo

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import RobustScaler
from catboost import CatBoostClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import resample

from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import cross_val_score

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

In [4]:
df = pd.to_csv('data/cleaned_data.csv', , index=False)

In [5]:
#Combine features and target into a single DataFrame
df = pd.concat([X, y], axis=1)

#print data head
df.head()

#print columns
df.columns

Index(['Marital Status', 'Application mode', 'Application order', 'Course',
       'Daytime/evening attendance', 'Previous qualification',
       'Previous qualification (grade)', 'Nacionality',
       'Mother's qualification', 'Father's qualification',
       'Mother's occupation', 'Father's occupation', 'Admission grade',
       'Displaced', 'Educational special needs', 'Debtor',
       'Tuition fees up to date', 'Gender', 'Scholarship holder',
       'Age at enrollment', 'International',
       'Curricular units 1st sem (credited)',
       'Curricular units 1st sem (enrolled)',
       'Curricular units 1st sem (evaluations)',
       'Curricular units 1st sem (approved)',
       'Curricular units 1st sem (grade)',
       'Curricular units 1st sem (without evaluations)',
       'Curricular units 2nd sem (credited)',
       'Curricular units 2nd sem (enrolled)',
       'Curricular units 2nd sem (evaluations)',
       'Curricular units 2nd sem (approved)',
       'Curricular units 2nd s

    age                                            address  \
0  27.0  47403 Emily Spring Suite 022\nLake Patricia, F...   
1  20.0   752 Nicole Circle Suite 705\nNew Angel, TX 44916   
2  24.0  18737 Moore Viaduct Apt. 584\nHolmesstad, IN 9...   
3  26.0       73666 Diaz Court\nWest Isaiahburgh, TX 53858   
4  22.0          42879 Rebecca Falls\nTravisport, VA 28500   

                        email             name  grade  
0      mccoyfaith@example.com     Kevin Horton   62.0  
1          sewing@example.org       Riley Hall   48.0  
2        nicole23@example.com       Logan Hahn   89.0  
3   jacksonrachel@example.com  Jennifer Martin   38.0  
4  kathleenmiller@example.net              NaN   83.0  


In [27]:
# Checking for nulls
df.isnull().sum().sum()

0