# Objective
The purpose of this kernel is to propose a machine learing based approach for grading students in any course. <br> 
Let's first look at the current techniques that's being used by universites to grade students. <br> 
### I. Absolute Grading System
This is one of the most common form of grading, where the criteria for grading is set before students is yet to start writing the paper. <br> 
Pros: This sets a standard for excellence and failing and is good for the courses with predefined difficulty of the examination.
Cons: This reduces the flexibility of the examination grately. Now the examiners are compiled to set predefined level of difficulty, which in it's nature is a subjective concept. A slight difficult question could result in unfair grades of the batch. <br> 
One example could be:
1. Assign marks > 90 --> A
2. Assign the next 70+ --> B
3. Assign the next 30+ --> C
<br>

### II. Relative Grading system
This is commonly adopted in universities. According this system, the grade vs marks distribution is decided by the the collective performance for a particular batch.<br> One example could be:
1. Assign top 10% students --> A
2. Assign the next 20% students --> B
3. Assign the next 40% students --> C
4. Assign the remaining 30% --> D

<br> 
This method brings in compeitition among students. While this is better or not is a topic of another discussion, but this method certainly grades few students low regardless their absolute marks. 
Pros: Difficulty of the paper no longer plays a role in the grade assignment.
Cons: Considering that the marks are assinged by merit, this method does not guarantees that the grades would be assigned on the basis of students' merit. <br>
<br>

### III. Applying ML based clustering for sound grading
Lets consider a distribution {1,1,2,2,3,4,5,5,6,10}. If we assign top 20% of the students a grade of A, this wouldn't be fair to the 8th student, since it's more logical for 6 to fall into a grade with 4s and 5s than with 10. <br> 
Now we can start forming the problem statement: For a student to feel that the grades are fair, other students having the same grade should have their marks as close as possible to the student. Hence the problem becomes a minimization problem of inter-cluster distances.
<br>

![Image](https://i.ytimg.com/vi/fGkGRoiBtKg/hqdefault.jpg)
<br>

Image above is a rough representation of a linear data seperated into different clusters. For each set of marks, we effectively want to minimize this cost function
<br>

![Image](https://www.saedsayad.com/images/Clustering_kmeans_c.png)<br>
(img source: https://www.saedsayad.com/)
<br>
K-Means algorithm exactly matches with this problem statement. Here is how it works:
1. We select the number of clusters k (number of grades we want to allot)
2. We group each data-point to the closest chosen centroid
3. We take centroid of newly formed clusters
4. We repeat 2 and 3 until the change in cluster points reduces below desired errors

### IV. Removing the outliers using DBSCAN
DBSCAN or Density-based spatial clustering of applications with noise, is an excellent algorithm when we need to detect the outlier in a spatial data. 
If we consider BiDirectional space, DBSCAN groups together points based on distances (usually euclidean distances) and a minimum density (number of points in this circle or hypersphere for multi-dimiensional data). 
It classifies the points in density regions below a specific thresholds as noise. Here is a depiction:

![Image](https://3.bp.blogspot.com/-rDYuyg00Z0w/WXA-OQpkAfI/AAAAAAAAI_I/QshfNVNHD_wXJwXEipRIVzDSX5iOEAy2wCEwYBhgL/s1600/DBSCAN_Points.PNG)
<br> 
If a data point doesn't have minimum number of samples within set threshold, it will be considered as an outlier. These two values are hyperparameters to DBSCAN. A value of minimum sample = 4 and a distance threshold = 4 is found to be working in most cases, however you can tune it as per your need. 

**Recommended Number of grades:**
It might happen that it will make more sense to assign a lower number of grades to class than maximum. Imagine a distribution of {3,3,4,9,9,10} we can easily segregate the bunch into 4-5 grades, but to a human eye, it makes more sense to assign 2 grades instead of 4. How does our model handles these values? 
<br> 

To tackle this issue, we use elbow method to figure out the sutiable numer of clusters for a distribution. The intution here is that, as we increase the number of clusters, J(cost function) which naturally reduce and become 0 for number of clusters = number of datapoints. We will increase the number of clusters only when it results in significant decline in the J value compared to previous k values. Here, we follow following steps:
1. Run the algorithms for different values of k
2. For each k, calculate the total loss
3. Plot total loss against the number of clusters 
4. The value representing the most significant bent is considered as the best fit for clustering
<br> 
The plot would look something like this:
<br>

![Image](https://miro.medium.com/max/832/1*8wV1j-klQA1xFvfaNXuVzg.png)<br>
(img source: kdnuggets)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import warnings
warnings.filterwarnings('ignore')
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### The Dataset
The dataset consist of the classroom marks of a specific subjects. Our objective is to model the data into sensible cluster for generating grades.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
df = pd.read_csv('/kaggle/input/classroommarks/marks.csv', header= None)

In [None]:
df.columns = 'to_drop marks'.split()
df.drop('to_drop', axis = 1, inplace = True)
x = df.marks.values
plt.figure(figsize = (12,6))
sns.distplot(x, bins = 50)

*The data has does not have normal distribution, and is negatively skewed. It already depicts the loop holes in the approaches that models it as normal distributions.*

In [None]:
plt.figure(figsize = (12,6))
sns.boxplot(x)

*We observe outliers at the lower end of the spectrum*


# Data Preprocessing

In [None]:
def conv100(x):
    scaler = MinMaxScaler
    x_fitted= scaler.fit_trainsform(x.reshape(-1,1))
    x_fitted = np.array([i*100 for i in x_fitted])
    return x_fitted
scaler = MinMaxScaler()
x_fitted = scaler.fit_transform(x.reshape(-1,1))
x_fitted = np.array([i*100 for i in x_fitted])

# Removing the outliers
### Eps and minimum sample is selected by intelligent guess to optimise outlier predictions

In [None]:

db = DBSCAN(eps = 4,min_samples = 4)
db.fit(x_fitted)
df = pd.DataFrame()
df['marks'] = [i[0] for i in x_fitted]
df['labels'] = [i[0] for i in (db.labels_.reshape(-1,1))]
df['ax'] = 1
plt.figure(figsize = (25,10))
ax = sns.scatterplot(x= df.marks, y = df.ax, hue = df.labels)
plt.setp(ax.get_legend().get_texts(), fontsize='22') 
plt.setp(ax.get_legend().get_title(), fontsize='22') 


In [None]:
df = df[df.labels!= -1]

# Performing K means clustering on the sample

In [None]:
n_grades =5
# Input the value of grades here


km = KMeans(
    n_clusters=n_grades, init='random',
    n_init=10, max_iter=300, 
    tol=1e-04, random_state=0
)
y_km = km.fit_predict(df['marks ax'.split()])
df['labels'] = y_km

plt.figure(figsize = (25,10))
sns.scatterplot(x= df.marks, y = df.ax, hue = df.labels)

In [None]:
marks_range = df.groupby('labels').min().sort_values('marks').reset_index(drop = True)
plt.figure(figsize = (12,8))
sns.barplot(marks_range.index, marks_range.marks)
print("These are the detected minimum values for each grade")
marks_range.head()

# Generating optimum number of grades
While we modelled the data using user specified grades, the question remains whether we can detect if the distribution is explained better by lower or higher number of grades.  
We can use the elbow method on distortion as well as the inertial. Inertial represents the total sum of squared distances to the respective cluster point, while distortion represeents the average the average of the squared distances. We chose the number of clusters from where the loss starts decreasing in a linear fashion.

In [None]:


distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(1, 10)
for k in K:
    # Building and fitting the model
    kmeanModel = KMeans(n_clusters=k).fit(df['marks ax'.split()])
    kmeanModel.fit(df['marks ax'.split()])
 
    distortions.append(sum(np.min(cdist(df['marks ax'.split()], kmeanModel.cluster_centers_,
                                        'euclidean'), axis=1)) / df['marks ax'.split()].shape[0])
    inertias.append(kmeanModel.inertia_)
 
    mapping1[k] = sum(np.min(cdist(df['marks ax'.split()], kmeanModel.cluster_centers_,
                                   'euclidean'), axis=1)) / df['marks ax'.split()].shape[0]
    mapping2[k] = kmeanModel.inertia_

In [None]:
plt.figure(figsize= (10,5))
plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()

In [None]:
plt.figure(figsize= (10,5))
plt.plot(K, inertias, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Inertia')
plt.title('The Elbow Method using Inertia')
plt.show()

### This shows that a cluster of 3 or 4 better represents our distribution
Feel free to share your views in the discussions for critical constructive reviews and improvements. You can reach me at https://www.linkedin.com/in/jay-dhanwant <br>

Thanks a lot and have a better day than me!