## Import Library

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Read File

In [2]:
df = pd.read_csv('../Tugas Region/data survey (1) (1).csv')
df.head(1)

Unnamed: 0,id,name,age,age_range,province,city,gender,interest,interest_detail
0,id_4100001227501134,Sultan,,,,,,Fotografi,


In [3]:
df.shape

(425332, 9)

Check data that has no gender
<br>
Later we will use this data as a predict in the model

## Preprocessing dataset
Data that has a gender label will be used as train and test data, otherwise it will be predicted after the model is completed in training.
<br>
In this training model, we only use the <b>name</b> column as the feature and the <b>gender</b> column as the label.

In [4]:
df.gender.isna().sum()

50881

In [5]:
df_clean = df[['name', 'gender']][df.gender.isin(['male', 'female'])]
df_clean.shape

(374451, 2)

Because there are some duplicate data, we will drop_duplicates. This is done so that the model training process is more efficient.

In [6]:
df_clean.drop_duplicates(ignore_index=True, inplace=True)
df_clean.shape

(137973, 2)

In [7]:
df_clean.gender.value_counts()

male      71333
female    66640
Name: gender, dtype: int64

**Label Encoder**
<br><br>
Change the male and female categories in the gender column to 1 and 0

In [8]:
le = LabelEncoder()
df_clean.gender = le.fit_transform(df_clean.gender)

In [9]:
df_clean.gender.unique()

array([1, 0])

Description
1. Male is <b>1</b>
2. Female is <b>0</b>

**Feature Extraction**
<br><br>
Then we do feature extraction in the name column, simply change the text to numerical (vector), the goal is for the computer to understand what will be learned.
<br>
Here we use CountVectorizer from sklearn to perform feature extraction

In [10]:
cv = CountVectorizer()
X = cv.fit_transform(df_clean.name.values.astype('U'))

In [11]:
X = cv.fit_transform(df_clean.name.apply(lambda x: np.str_(x)))

the <b>values.astype('U')</b> function consumes more memory if the Series you want to convert is very large. <br><br>On the other hand, if you use a lambda expression to convert only the data in a Series from <b>str</b> to <b>numpy.str_</b>, the result of which will also be accepted by the fit_transformfunction,<br > it will be faster and will not increase memory usage.

Source : https://stackoverflow.com/questions/39303912/tfidfvectorizer-in-scikit-learn-valueerror-np-nan-is-an-invalid-document

In [12]:
cv.get_feature_names()

['aa',
 'aab',
 'aabaronk',
 'aabbee',
 'aabit',
 'aabryella',
 'aad',
 'aadc',
 'aadeon',
 'aadi',
 'aadiin',
 'aadnaa',
 'aadp',
 'aae',
 'aaee',
 'aafito',
 'aag',
 'aagungg',
 'aah',
 'aaij',
 'aaisy',
 'aajahh',
 'aajha',
 'aajo',
 'aak',
 'aakechill',
 'aakustik',
 'aal',
 'aaliciaa',
 'aalim',
 'aam',
 'aamal',
 'aamalo',
 'aambarwati',
 'aamd',
 'aamelia',
 'aamet',
 'aamiin',
 'aan',
 'aanch',
 'aandesttyantoro',
 'aandora',
 'aang',
 'aangz',
 'aani',
 'aaniiytaa',
 'aank',
 'aanoy',
 'aanthevaline',
 'aantrie',
 'aanurjaman',
 'aap',
 'aar',
 'aara',
 'aarafi',
 'aarif',
 'aaron',
 'aarrggaa',
 'aarsakha',
 'aartsen',
 'aas',
 'aasaa',
 'aasfortrccing',
 'aashiqui',
 'aashro',
 'aashuikk',
 'aaswar',
 'aasyifah',
 'aat',
 'aathe',
 'aatinaa',
 'aaucup',
 'aaw',
 'aawawan',
 'aaxz',
 'aay',
 'aayu',
 'aayuu',
 'aaz',
 'aazang',
 'aazar',
 'aazelfatih',
 'aazhaen',
 'aaziiez',
 'aazrie',
 'ab',
 'aba',
 'abach',
 'abad',
 'abadi',
 'abadii',
 'abadinanjaya',
 'abady',
 'abaeh'

Split Data Train and Test
<br><br>
1. <b>X</b> = Feature
2. <b>y</b> = Labels

In [13]:
y = df_clean.gender

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Naive Bayes Classifier

In [15]:
MNB = MultinomialNB()
MNB.fit(X_train, y_train)
MNB.score(X_test, y_test)

0.8298339629271722

Accuracy of our Model <b>82.98339629271722</b> %

Sample Prediction

In [16]:
def gender_predictor(names):
    name = [names]
    vector = cv.transform(name).toarray()
    if MNB.predict(vector) == 0:
        return "Female"
    else:
        return "Male"

In [17]:
sample_name = ['rafka ramadhan', 'yasmine nabilla', 'tomy hardiansyah', 'gigi']

In [18]:
df_predict = df['name'][df.gender.isna()].drop_duplicates().tolist()
len(df_predict)

18754

In [19]:
sample_names = df_predict[:20]
gender_predict = [gender_predictor(name) for name in sample_name]
pd.DataFrame({'name': sample_name, 'prediction': gender_predict})

Unnamed: 0,name,prediction
0,rafka ramadhan,Male
1,yasmine nabilla,Female
2,tomy hardiansyah,Male
3,gigi,Female


## Random Forest Classifier

In [20]:
RF = RandomForestClassifier(max_depth=2, random_state=0)
RF.fit(X_train, y_train)
RF.score(X_test, y_test)

0.5162962312219977

Accuracy of our Model <b>51.62962312219977</b> %

## KNearest Neighbors

In [21]:
KNN = KNeighborsClassifier()
KNN.fit(X_train, y_train)

KNeighborsClassifier()

In [22]:
KNN.score(X_test, y_test)

0.6860889045067206

n_neighbors = 2, accuracy : 0.6135465167354828<br>
n_neighbors = 1, accuracy : 0.6795879820785382<br>
n_neighbors = 5, accuracy : 0.6860889045067206

## Summary

From the comparison made, it is obtained:
1. The Naive Bayes Classifier algorithm gets the highest accuracy from the other two algorithms, reaching 82.9%. <br>
The computational process of training and testing is also the fastest compared to the other two algorithms.
2. Random Forest gets the lowest accuracy, but from the computational aspect of training and testing it is faster than KNN.
3. KNN gets the second highest accuracy after NBC, but there are shortcomings from the computational aspect.<br>
KNN takes a long time during the testing process, the higher the value of n_neighbors, the longer the computational process. <br>
It makes sense because for example n_neighbors = 5 then KNN checks the testing data one by one to every 5 closest neighbors.

Thank you