In [2]:
import pandas as panda
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score

## Practice & Exercise:Naive Bayes Implementation 

Naive Bayes is a very simple but powerful algorithm used for prediction as well as classification. It tries to find target variable probabilty given the probabilities of features.Naive Bayes is a very fast algorithm that can predict results(with high accuracy) even for small datasets, thus it can be used over real-time data to make predictions. We will use adult dataset here and using naive bayes classifier find out the income probability and its accuracy result with this algorithm.

##### pandas is used to import the dataset from a file into data frame.train_test_split is used to split the model into testing and training data.accuracy_score  is used to compare the occuracy of results with target Field

## Importing Dataset

In [3]:
dataset = panda.read_csv('adult.csv')
dataset.head() 

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [4]:
# Dropped few columns from dataset in order to make the dataset and model simpler
dataset.drop(['fnlwgt','workclass','education','educational-num','marital-status','occupation','relationship','race','capital-gain','capital-loss','native-country'],axis='columns',inplace=True)
dataset.head()

Unnamed: 0,age,gender,hours-per-week,income
0,25,Male,40,<=50K
1,38,Male,50,<=50K
2,28,Male,40,>50K
3,44,Male,40,>50K
4,18,Female,30,<=50K


## Defining features and output

In [5]:
input = dataset.drop('income', axis=1)      
target = dataset['income'] # target denotes target which is adult income  

###### input denotes features "age", "marital-status", "gender", "hours-per-week" etc(everything except Income) so here we have dropped the "income" column here

In [6]:
len(input)

48842

##### Machine Learning models can not handle text columns so we have to convert them to numbers. Here we are generating Dummies values with get_dummies 

In [7]:
dummies = panda.get_dummies(input.gender) 
dummies.head(3)

Unnamed: 0,Female,Male
0,0,1
1,0,1
2,0,1


Dropping gender column as well because of dummy variable trap theory. For more details.
https://medium.com/datadriveninvestor/dummy-variable-trap-c6d4a387f10a

In [8]:
input.drop(['gender'],axis='columns',inplace=True) 
input.head()

Unnamed: 0,age,hours-per-week
0,25,40
1,38,50
2,28,40
3,44,40
4,18,30


## Spliting dataset as into training and testing data


##### The below script splits the dataset into 80% train data and 20% test data. Now we are going to train the model from this training data and once the model is trained then we test it on the testing data.

In [9]:
X_train, X_test, Y_train, Y_test = train_test_split(input,target, test_size=0.20)

In [10]:
len(X_train)

39073

In [11]:
len(X_test)

9769

## Training and prediction

In [12]:
from sklearn.naive_bayes import MultinomialNB  # Imported the Naive Bayes Algorithm from the sklearn library
model = MultinomialNB().fit(X_train, Y_train)  #Train the model on the training data 
Y_pred = model.predict(X_test) #Test the model on the testing data and comparing the result with the actual target. 

##### MultinomialNB is the classifier we use to train our model. There are other classifiers such as GaussianNB and Bernoulli. https://www.reddit.com/r/MachineLearning/comments/2uhhbh/difference_between_binomial_multinomial_and/

In [13]:
print(Y_pred)

['<=50K' '<=50K' '<=50K' ... '<=50K' '<=50K' '<=50K']


## Evaluating the Algorithm

In [14]:
print(accuracy_score(Y_pred,Y_test))

0.7514586958747057


##### Accuracy of Naive Bayes Algorithm over ADULT dataset is 75.14%

In [17]:
Y_test[:10]

18394    <=50K
15332     >50K
25186    <=50K
17051    <=50K
26354    <=50K
24141    <=50K
18291    <=50K
14516    <=50K
5615     <=50K
28935    <=50K
Name: income, dtype: object

In [18]:
model.predict(X_test[:10])


array(['<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K',
       '<=50K', '<=50K', '<=50K'], dtype='<U5')

###### We can see the variation of testing data and results as accuracy is 75.14%