
### Supervised Learning
### Activity: Building a Student Intervention System

### Question 1 - Classification vs. Regression
*Your goal for this project is to identify students who might need early intervention before they fail or pass. Which type of supervised learning problem is this, classification or regression? Why?*

**Answer: ** 

Classification problem ,because there is 2 outcomes:
1.students who might need early intervention(yes,1)
2.students who don't need early intervention(no,0)

### Question-2
load necessary Python libraries and load the student data. Note that the last column from this dataset, `'passed'`, will be our target label (whether the student graduated or didn't graduate). All other columns are features about each student.

In [2]:
# Import libraries

import numpy as np
import pandas as pd


In [3]:
# Read student data

data=pd.read_csv('student-data.csv')
data.head()
#data.shape

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,passed
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,no
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,no
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,yes
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,yes
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,yes


In [4]:
data.shape

(395, 31)

### Question-3
Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, you will need to compute the following:
- The total number of students, `n_students`.
- The total number of features for each student, `n_features`.
- The number of those students who passed, `n_passed`.
- The number of those students who failed, `n_failed`.
- The graduation rate of the class, `grad_rate`, in percent (%).


In [5]:
# Calculate number of students

num_students=len(data)


In [6]:
# Calculate number of features

num_features=len(data.columns[:-1])


In [7]:
# Calculate passing students

num_passed=len(data[data.passed=='yes'])


In [8]:
# Calculate failing students

num_failed=len(data[data.passed=='no'])


In [9]:
# Calculate graduation rate

gradu_rate=num_passed/num_students*100


In [10]:
# Print the results
print('Total number of students:',num_students)
print('Number of features:',num_features)
print('Number of students who passed:',num_passed)
print('Number of students who failed:',num_failed)
print('Graduation rate:',gradu_rate)

Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate: 67.08860759493672


## Preparing the Data
you will prepare the data for modeling, training and testing.

### Question-4 Identify feature and target columns


separate the student data into feature and target columns to see if any features are non-numeric.

In [None]:
# Extract feature columns

#data.columns

In [11]:
feature_col = list(data.columns[:-1])

In [None]:
# Extract target column 'passed'

In [45]:
target_col=data.columns[-1]
#target_col=pd.DataFrame(data['passed'])
target_col

'passed'

In [None]:
# Separate the data into feature data and target data (X and y, respectively)

In [46]:
X=data[feature_col]
y=data[target_col]

print('Feature data:')
print(X.head())
print('\nTarget data:',target_col)

Feature data:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  ...  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher  ...   
1     GP   F   17       U     GT3       T     1     1  at_home     other  ...   
2     GP   F   15       U     LE3       T     1     1  at_home     other  ...   
3     GP   F   15       U     GT3       T     4     2   health  services  ...   
4     GP   F   16       U     GT3       T     3     3    other     other  ...   

  higher internet  romantic  famrel  freetime goout Dalc Walc health absences  
0    yes       no        no       4         3     4    1    1      3        6  
1    yes      yes        no       5         3     3    1    1      3        4  
2    yes      yes        no       4         3     2    2    3      3       10  
3    yes      yes       yes       3         2     2    1    1      5        2  
4    yes       no        no       4         3     2    1    2      5        4  

[5 rows x 30 colum

### Question-5 Preprocess Feature Columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation. Run the code cell below to perform the preprocessing routine discussed in this section.

In [47]:
def preprocess_features(X):
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index=X.index)

    # Investigate each feature column for the data
    for col,col_data in X.iteritems():
        
        #data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype==object:
            col_data=col_data.replace(['yes', 'no'],[1, 0])

        #data type is categorical, convert to dummy variables
        if col_data.dtype==object:
        
            col_data =pd.get_dummies(col_data,prefix = col)  
        
        output = output.join(col_data)
    
    return output

X= preprocess_features(X)
X.columns
print('Processed feature columns:\n',X.head())

Processed feature columns:
    school_GP  school_MS  sex_F  sex_M  age  address_R  address_U  famsize_GT3  \
0          1          0      1      0   18          0          1            1   
1          1          0      1      0   17          0          1            1   
2          1          0      1      0   15          0          1            0   
3          1          0      1      0   15          0          1            1   
4          1          0      1      0   16          0          1            1   

   famsize_LE3  Pstatus_A  ...  higher  internet  romantic  famrel  freetime  \
0            0          1  ...       1         0         0       4         3   
1            0          0  ...       1         1         0       5         3   
2            1          0  ...       1         1         0       4         3   
3            0          0  ...       1         1         1       3         2   
4            0          0  ...       1         0         0       4         3   

   g

In [48]:
X.shape

(395, 48)

### Question - 6 Implementation: Training and Testing Data Split
So far, we have converted all _categorical_ features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. you will need to implement the following:
- Randomly shuffle and split the data (`X`, `y`) into training and testing subsets.
  - Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
  - Set a `random_state` for the function(s) you use, if provided.
  - Store the results in `X_train`, `X_test`, `y_train`, and `y_test`.

In [None]:
# splitting the data into train and test


In [19]:
from sklearn.model_selection import train_test_split

#number of training points
num_train=300

#number of testing points
num_test=X.shape[0]-num_train

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=num_test,random_state=2)


In [20]:
# Show the results of the split

print('Number of training set samples:',num_train)
print('Number of testing set samples:',num_test)

Number of training set samples: 300
Number of testing set samples: 95


### Question - 7  Training and Evaluating Models
In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data and measure the accuracy score.

explaination

1.Support Vector Machine

-The general applications of Support Vector Machine
-Support Vector Machine can be used for both classification and regression problems. For classification, general applications can be text categorization or image classification.
-Sales forecasting when running promotions

-Strengths and Weaknesses
-Strengths:
  . Works well in high dimensional spaces even if there are more dimensions than samples.
  . Memory efficient for using support vectors.
-weaknesses:
  . If there are too many features than samples, SVM will perform poorly
  . High computational cost: SVMs scale exponentially in training time
-The model can be slow to train, however it shouldn't matter because we probably want to run the model periodically and not in real-time.
 
2.Random Forest Classifier:

-Strengths and Weaknesses:
  .The model is easy to use.
  .It can very easily handle categorical variables that do not expect linear features or even features that interact linearly.   .The model also handles high dimensional spaces very well, as well as large numbers of training examples. 
  .Finally, it's less likely to overfit than a decision tree. However, it's more difficult to interpret a Random Forest than a Decision Tree.

-We should apply this model because it's easy to use, handles categorical variables very well


3.Logistic Regression:

-Industry usage:
 -Identify and automatically categorize protein sequences into one of 11 pre-defined classes
Tremendous potential for further bioinformatics applications using Logistic Regression

-Strengths:
 -Many ways to regularize the model to tolerate some errors and avoid over-fitting
 -Unlike Support Vector Machines, we can easily take in new data using an online gradient descent method
 
-Weaknesses:
 -Requires observations to be independent of one another
 -It aims to predict based on independent variables, if there are not properly identified, Logistic Regression provides little   predictive value


In [27]:
# Import the three supervised learning models from sklearn

from sklearn import svm
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score



In [None]:
# fit model-1  on traning data 

In [28]:
svm=SVC()
svm.fit(X_train,y_train)


SVC()

In [None]:
# predict on the test data 

In [29]:
y_pred=svm.predict(X_test)

In [None]:
# calculate the accuracy score

In [30]:
print('Accuracy is:',accuracy_score(y_pred,y_test))

Accuracy is: 0.6842105263157895


In [None]:
# fit the model-2 on traning data and predict on the test data and measure the accuracy

In [41]:
#Create a Gaussian Classifier
clf=RandomForestClassifier()
clf.fit(X_train,y_train)
#predict on the test data
y_pred=clf.predict(X_test)
#calculate the accuracy score
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.7157894736842105


In [None]:
# fit the model-3 on traning data and predict on the test data and measure the accuracy

In [42]:
lr=LogisticRegression()
lr.fit(X_train,y_train)

#predict on the test data
y_pred=lr.predict(X_test)

#calculate the accuracy score
print('Accuracy:',accuracy_score(y_pred,y_test))

Accuracy: 0.7368421052631579


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Inference:

Logistic Regression have better accuracy  in prediction than SVM and RandomForest model.