This is my 1st upload on machine learning. This is a very basic clssification problem. I want to guess the student gender based on his/her scores.

### Imports


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Load Data

In [None]:
students=pd.read_csv("../input/StudentsPerformance.csv")

### Understand Data

In [None]:
students.head(5)

In [None]:
students.info()

In [None]:
students.describe()

#### Checking out target numbers

Numbers are quite balanced

In [None]:
students.gender.value_counts()

#### There are no missing values

-----

----

## Exploratory Data Analysis

In [None]:
sns.set_context("notebook", font_scale=1.5)
sns.pairplot(students, hue="gender", height=3.5, palette='husl', diag_kind="kde",
             plot_kws=dict(s=30, linewidth=0.2))

#### First Assumptions

It is obvious that male students are doing better at math scores. On the other hand, female students are doing better on both reading and writing scores. Guessing that our model will have some chance on succeeding based on the score data.

Let's take another look on the reading and writing score againt math score. We expect to see lower scores for math

In [None]:
#darkgrid, whitegrid, dark, white, ticks
sns.set_style("darkgrid")
#paper, notebook, talk, poster
sns.set_context("notebook", font_scale=1.5)
plt.figure(figsize=(12,8))
p1=sns.kdeplot(students['reading score'], shade=True, color="teal", bw=.9)
p1=sns.kdeplot(students['math score'], shade=True, color="lightslategray", bw=.9)
plt.show()

In [None]:
#darkgrid, whitegrid, dark, white, ticks
sns.set_style("darkgrid")
#paper, notebook, talk, poster
sns.set_context("notebook", font_scale=1.5)
plt.figure(figsize=(12,8))
p1=sns.kdeplot(students['writing score'], shade=True, color="indianred", bw=.9)
p1=sns.kdeplot(students['math score'], shade=True, color="lightslategray", bw=.9)
plt.show()

Lastly let's compare reading vs math & writing vs math score with gender as hue. We are seeing the same thing on pairplot but i want to lok it a little bit closer this time to be more clear.

In [None]:
#darkgrid, whitegrid, dark, white, ticks
sns.set_style("darkgrid")
#paper, notebook, talk, poster
sns.set_context("notebook", font_scale=1.5)
ax = sns.lmplot( x="reading score", y="math score", data=students, fit_reg=False, hue='gender', height=9,
           palette="Set2", aspect=1.3, legend=False,
           scatter_kws={"alpha":0.5,"s":50})
#ax.set(xlim=(0,800))
#ax.set(ylim=(0,300))
plt.title("Reading vs Math")
plt.legend(loc='lower right')
plt.xlabel("Reading Score")
plt.ylabel("Math Score")
plt.show()

In [None]:
#darkgrid, whitegrid, dark, white, ticks
sns.set_style("darkgrid")
#paper, notebook, talk, poster
sns.set_context("notebook", font_scale=1.5)
ax = sns.lmplot( x="writing score", y="math score", data=students, fit_reg=False, hue='gender', height=9,
           palette="Set2", aspect=1.3, legend=False,
           scatter_kws={"alpha":0.5,"s":50})
#ax.set(xlim=(0,800))
#ax.set(ylim=(0,300))
plt.title("Writing vs Math")
plt.legend(loc='lower right')
plt.xlabel("Writing Score")
plt.ylabel("Math Score")
plt.show()

-----


----

## Prepare Data

In [None]:
students.head(3)

I am going to drop the categorical columns and focus on their scores.

In [None]:
# udents.drop(['race/ethnicity','parental level of education','lunch','test preparation course'], axis=1, inplace=True)
students.drop(students.columns[1:5], axis=1, inplace=True)

In [None]:
students.head(3)

Change gender categorical variable

In [None]:
students['gender'] = students['gender'].map({'female':1, 'male': 0})

Take a look at the correlations

In [None]:
students.corr()

----

----

## Machine Learning

Selecting X and y

In [None]:
X = students.drop(columns=['gender'])
y = students['gender']

I am not really sure if this data needs some rescaling but i am doing it just to be sure.

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
print(rescaledX[0:5,:])

#### Split Data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
test_size = 0.33
seed = 7

In [None]:
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=test_size,
random_state=seed)

---

#### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier(random_state=0, max_depth=6)
tree.fit(X_train, y_train)

In [None]:
print('Training set is {}'.format(tree.score(X_train, y_train))) 
print('Testing set is {}'.format(tree.score(X_test, y_test)))

----

#### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
forest = RandomForestClassifier(n_estimators=7, random_state=5)
forest.fit(X_train, y_train)

In [None]:
print('Training set is {}'.format(forest.score(X_train, y_train))) 
print('Testing set is {}'.format(forest.score(X_test, y_test)))

----

#### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
gbrt = GradientBoostingClassifier(random_state=0, max_depth=5)
gbrt.fit(X_train, y_train)

In [None]:
print('Training set is {}'.format(gbrt.score(X_train, y_train))) 
print('Testing set is {}'.format(gbrt.score(X_test, y_test)))

-----

----

### Feature Engineering

I am creating some new columns in our data, trying to get some better results. I am creating math-strong column. The students that have > than average are labeled strong. The students that have < than average are labeled weak.

In [None]:
threshold = students['math score'].mean()
threshold

In [None]:
students["math-strong"] = ["strong" if i > threshold else "weak" for i in students['math score']]
students.head(3)

In [None]:
students['math-strong'] = students['math-strong'].map({'strong':1, 'weak': 0})
students.head(3)

Taking a look at correlation table. I dont really think this is so important but i am going to leave it for the time being.

In [None]:
students.corr()

I am creating one more column that is labeled worse or better. The students that has better score on math than writing are labeled better. The students that have worse score on math than writing are labeled worse.

In [None]:
students["borw"] = students["math score"] - students["writing score"]
students["borw"] = ["better" if i > 0 else "worse" for i in students["borw"]]
students.head(3)

In [None]:
students['borw'] = students['borw'].map({'better':1, 'worse': 0})

In [None]:
students.corr()

This seems promising. Trying Decision Tree once more to see if are doing better.

In [None]:
X = students.drop(columns=['gender'])
y = students['gender']

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
print(rescaledX[0:5,:])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=test_size,
random_state=seed)

In [None]:
new_tree = DecisionTreeClassifier(random_state=0, max_depth=7)
new_tree.fit(X_train, y_train)
print('Training set is {}'.format(new_tree.score(X_train, y_train))) 
print('Testing set is {}'.format(new_tree.score(X_test, y_test)))

We are getting 0.05% better results than before.