# Project Code Analysis

## Data Cleaning Function Breakdown

In [1]:
import pandas as pd

In [2]:
df_towns = pd.read_excel('./data/MMNames_mimu.xlsx', sheet_name='Towns')
df_villages = pd.read_excel('./data/MMNames_mimu.xlsx', sheet_name='Villages')
df_news = pd.read_csv('./data/MMNames_news.csv')

In [3]:
print(df_towns.info())
print(df_villages.info())
print(df_news.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536 entries, 0 to 535
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   SR_Name_Eng    536 non-null    object
 1   Town_Name_Eng  536 non-null    object
dtypes: object(2)
memory usage: 8.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14047 entries, 0 to 14046
Data columns (total 2 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   SR_Name_Eng             14047 non-null  object
 1   Village_Tract_Name_Eng  14047 non-null  object
dtypes: object(2)
memory usage: 219.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58789 entries, 0 to 58788
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   admin1    58789 non-null  object
 1   location  58789 non-null  object
dtypes: object(2)
memory usage: 918.7+ KB
None


In [4]:
df_towns.head()

Unnamed: 0,SR_Name_Eng,Town_Name_Eng
0,Ayeyarwady,Bogale Town
1,Ayeyarwady,Danubyu Town
2,Ayeyarwady,Dedaye Town
3,Ayeyarwady,Einme Town
4,Ayeyarwady,Du Yar Town


In [5]:
df_villages.head()

Unnamed: 0,SR_Name_Eng,Village_Tract_Name_Eng
0,Ayeyarwady,(Kyun Nyo Gyi) Kyun Hteik
1,Ayeyarwady,Auk Hle Seik
2,Ayeyarwady,Aye
3,Ayeyarwady,Aye Yar
4,Ayeyarwady,Boe Di Kwe


In [6]:
df_news.head()

Unnamed: 0,admin1,location
0,Kachin,Hseng Taung (Ywar Haung)
1,Kachin,Hseng Taung (Ywar Haung)
2,Shan-South,Lel Kyar
3,Shan-South,Lel Kyar
4,Sagaing,Nyaung Pin Kan


In [7]:
import data_cleaning as dc
dc.main()

What data_cleaning(dc) import do is that it loads the data from MMNames_mimu.xlsx excel sheets as well as MMNames_news.csv then it goes throught data cleaning process.

1. It changes the towns, villages, and news columns to have consistent names

In [8]:
df_towns.columns = ['SR_Name', 'name']

In [9]:
df_towns.head()

Unnamed: 0,SR_Name,name
0,Ayeyarwady,Bogale Town
1,Ayeyarwady,Danubyu Town
2,Ayeyarwady,Dedaye Town
3,Ayeyarwady,Einme Town
4,Ayeyarwady,Du Yar Town


In [10]:
df_villages.columns = ['SR_Name', 'name']

In [11]:
df_villages.head()

Unnamed: 0,SR_Name,name
0,Ayeyarwady,(Kyun Nyo Gyi) Kyun Hteik
1,Ayeyarwady,Auk Hle Seik
2,Ayeyarwady,Aye
3,Ayeyarwady,Aye Yar
4,Ayeyarwady,Boe Di Kwe


In [12]:
df_news.columns = ['SR_Name', 'name']

In [13]:
df_news.head()

Unnamed: 0,SR_Name,name
0,Kachin,Hseng Taung (Ywar Haung)
1,Kachin,Hseng Taung (Ywar Haung)
2,Shan-South,Lel Kyar
3,Shan-South,Lel Kyar
4,Sagaing,Nyaung Pin Kan


2. It then removes the 'Town' word from each of the rows in 'name' column

In [14]:
df_towns['name'] = df_towns['name'].apply(lambda name: name.replace('Town', '').strip())
df_towns.head()

Unnamed: 0,SR_Name,name
0,Ayeyarwady,Bogale
1,Ayeyarwady,Danubyu
2,Ayeyarwady,Dedaye
3,Ayeyarwady,Einme
4,Ayeyarwady,Du Yar


3. It then combined the two dfs to single df

In [15]:
df_mimu = pd.concat([df_towns, df_villages], ignore_index=True)

In [16]:
df_mimu.head()

Unnamed: 0,SR_Name,name
0,Ayeyarwady,Bogale
1,Ayeyarwady,Danubyu
2,Ayeyarwady,Dedaye
3,Ayeyarwady,Einme
4,Ayeyarwady,Du Yar


In [17]:
print(df_towns.shape, df_villages.shape, df_mimu.shape)

(536, 2) (14047, 2) (14583, 2)


4. It remove the missing values from the combined one, for some reasons the duplicates are not dropped from this combined dataframe. But we remove the duplicates from the df_news

In [18]:
df_mimu.isna().sum()

SR_Name    0
name       0
dtype: int64

In [19]:
df_mimu.duplicated().sum()

1573

In [20]:
df_mimu = df_mimu.dropna()

In [21]:
duplicated_names = df_mimu['name'][df_mimu['name'].duplicated()].unique()
print(duplicated_names)

['Hainggyikyun' 'Myo Hla' 'Minhla' ... 'Pa Thi' 'Ta Mar Ta Kaw' 'Tet Thit']


In [22]:
counts = df_mimu['name'].value_counts()
duplicates = counts[counts > 1]
print(duplicates)

name
Nyaung Pin Thar    27
Thar Yar Kone      22
Ywar Thit          21
Nyaung Kone        20
Ma Gyi Kone        18
                   ..
Zee Kaing           2
Da Naw              2
Ta Mar Kone         2
Pi Tauk Pin         2
Ku Lar Gyi Su       2
Name: count, Length: 1547, dtype: int64


In [23]:
df_news.shape

(58789, 2)

In [24]:
print(df_news.shape, df_news.value_counts())

(58789, 2) SR_Name      name          
Sagaing      Monywa            1046
             Kale              1005
             Yinmarbin          785
Mandalay     Pyigyitagon        672
             Mandalay           524
                               ... 
Rakhine      Sa Bai Kone          1
Kachin       Hseng Awng           1
Tanintharyi  Hpaung Taw Gyi       1
Kachin       Htan Ta Pin          1
Rakhine      Aung Ba La           1
Name: count, Length: 4930, dtype: int64


In [25]:
df_news = df_news.drop_duplicates()

In [26]:
print(df_news.shape, df_news.value_counts())

(4930, 2) SR_Name     name         
Ayeyarwady  Ah Choke Gyi     1
Sagaing     Ma Gyi Tan       1
            Ma Le Thar       1
            Ma Khauk         1
            Ma Hti Thar      1
                            ..
Magway      Tet Ma           1
            Tei Pin Kan      1
            Te Kone          1
            Te Gyi           1
Yangon      Zee Hpyu Kone    1
Name: count, Length: 4930, dtype: int64


5. Unique names are extracted from the SR_Name column of the combined dataframe, then try to use the fuzzywuzzy(in my case updated to thefuzz library) to get the best matching elements with score_cutoff at 90 which means only matches with 90% or above is considered

In [27]:
mimu_SR_Name = df_mimu['SR_Name'].unique()

In [28]:
print(mimu_SR_Name)

['Ayeyarwady' 'Bago (East)' 'Bago (West)' 'Chin' 'Kachin' 'Kayah' 'Kayin'
 'Magway' 'Mandalay' 'Mon' 'Nay Pyi Taw' 'Rakhine' 'Sagaing' 'Shan (East)'
 'Shan (North)' 'Shan (South)' 'Tanintharyi' 'Yangon']


In [29]:
from thefuzz import process

In [30]:
def find_closest_match(row, mimu_SR_Name):
    matches = process.extractOne(row, mimu_SR_Name, score_cutoff=90)
    return matches[0]

In [31]:
df_news.loc[:, 'SR_Name']  = df_news['SR_Name'].apply(lambda x: find_closest_match(x, mimu_SR_Name))

In [32]:
df_news.head()

Unnamed: 0,SR_Name,name
0,Kachin,Hseng Taung (Ywar Haung)
2,Shan (South),Lel Kyar
4,Sagaing,Nyaung Pin Kan
6,Magway,Ywar Taw
10,Chin,Tlanglo


In [33]:
df_mimu = pd.concat([df_mimu, df_news], ignore_index=True)

In [34]:
df_mimu.shape

(19513, 2)

This is how we get the final MMNames_clean.csv file where data are cleaned.

## Data Preprocessing Function Breakdown

nltk is the key in this function\
punkt, stopwords, wordnet\
the clean_text function is not used in this project

preprocess_category and preprocess_onehot are the only functions being used for the initial phase.\
preprocessing_category\

```df[column] = df[column].cat.codes```

This create label encoding for the categorical column. (This will create ordinality in the data)

```df = pd.get_dummies(df, columns=[column])```

This create one-hot encoding for the column.

In [37]:
has_special = df_mimu['name'].str.contains(r'[^a-zA-Z\s]', na=False)
print(f"Rows with special characters: {has_special.sum()} out of {len(df_mimu)}")

Rows with special characters: 925 out of 19513


here we can note that there are some special characters in some of the rows, so I figured we need to use some of the functions here.

In [None]:
def preprocess_textinput(df, text_column):    
    vectorizer = CountVectorizer(max_features=5000)  
    df['cleaned_text'] = df[text_column].apply(remove_special_characters)    
    X = vectorizer.fit_transform(df['cleaned_text']).toarray()
    return X

def preprocess_textinput_tfidf(df, text_column):    
    vectorizer = TfidfVectorizer(max_features=5000)  
    df['cleaned_text'] = df[text_column].apply(remove_special_characters)    
    X = vectorizer.fit_transform(df['cleaned_text']).toarray()
    return X

The vectorizers are one of the important parts of this project, as this introduce a new concept of weighting the words in the text instead of one-hot encoding. This reduce the total parameters needed to compute significantly. 

# Project Code Breakdown

Cleaning and preprocessing are discussed above. We will look at data preparation and model creationg in this step.\
The preprocessed data are tehn split into 70/30 train-test split with SR_Name being the classification target.

2 hidden layers with 32 and 16 neurons are used for initial model's hidden layers and they are both activated using 'relu'.

In [None]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

The model is compile with 
1. adam optimizer: which works well for most of the problems
2. spare_categorical_crossentropy: which works similar to how we do label encoding for the target column ('SR_Name'), efficiently handles all 18 classes
3. accuracy: since we wants to see how accurate the model is?

evaluated using classification_report since this is multi-class classification problem.