#  Data wrangling

Data wrangling, sometimes referred to as **data munging**, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A **data wrangler** is a person who performs these transformation operations. [Wiki](https://en.wikipedia.org/wiki/Data_wrangling)

Wrangler is an interactive tool for data cleaning and transformation.
Spend less time formatting and more time analyzing your data. [stanford](http://vis.stanford.edu/wrangler/)

 ### Example - 1

#### 0 - Requirement

I was given a data problem where I have to write a model to auto-clean database values without manual work. This was my first practical ML solution delivered to my client.  

####  1. Analysis

In [1]:
#!/usr/bin/env python3.5
# encoding: utf-8

import random
import csv
from nltk import classify, NaiveBayesClassifier, MaxentClassifier, DecisionTreeClassifier

age_file = 'age.csv'
training_percent = 0.8

Analysing the dataset before processing. I was given a column of actual values their corresponding correction values. I have planned to use the same solution similar to name gender prediction in my previous project [Github - Name Gender Prediction](https://github.com/vijayanandrp/ML-001-Name-Text-Gender-Predictor-Classifier)

In [27]:
import pandas as pd
age_df = pd.read_csv(age_file, header=None, usecols=[1,2])
age_df.rename(columns={1:'actual', 2:'correction'}, inplace=True)

In [3]:
age_df.shape

(480, 2)

In [4]:
age_df.describe()

Unnamed: 0,actual,correction
count,480,480
unique,480,14
top,36years old,30 to 34
freq,1,51


In [5]:
age_df.head()

Unnamed: 0,actual,correction
0,18 to 20,18 to 20
1,18,18 to 20
2,18 - 20,18 to 20
3,18 - 21,18 to 20
4,18 - 22,18 to 20


In [6]:
age_df.tail()

Unnamed: 0,actual,correction
475,?? ??,?
476,?? ??? ????,?
477,???? ??? ????,?
478,SHL Bureau 7,?
479,SMT6,?


In [7]:
age_df.sample(10)

Unnamed: 0,actual,correction
435,78,65+
284,47 years,45 to 49
261,45,45 to 49
46,22 years,21 to 24
140,31,30 to 34
109,28years old,25 to 29
461,Jonger dan 18,Under 18
129,30 a 34,30 to 34
187,36 - 40,35 to 39
108,28Years,25 to 29


In [26]:
age_df['correction'].unique()

array(['18 to 20', '21 to 24', '25 to 29', '30 to 34', '35 to 39',
       '40 to 44', '45 to 49', '50 to 54', '55 to 59', '60 to 64', '65+',
       'Declined to Respond', 'Under 18', '?'], dtype=object)

#### 2. Solution

##### Making feature matrix  X 

In [8]:
def feature_extraction(_data):
    """ This function is used to extract features in a given data value"""
    # Find the digits in the given string Example - data='18-20' digits = '1820'
    digits = str(''.join(c for c in _data if c.isdigit()))
    # calculate the length of the string
    len_digits = len(digits)
    # splitting digits in to values example - digits = '1820' ages = [18, 20]
    ages = [int(digits[i:i + 2]) for i in range(0, len_digits, 2)]
    # checking for special character in the given data
    special_character = '.+-<>?'
    spl_char = ''.join([c for c in list(special_character) if c in _data])
    # handling decimal age data
    if len_digits == 3:
        spl_char = '.'
        age = "".join([str(ages[0]), '.', str(ages[1])])
        # normalizing
        age = int(float(age) - 0.5)
        ages = [age]
    # Finding the maximum, minimum, average age values
    max_age = 0
    min_age = 0
    mean_age = 0
    if len(ages):
        max_age = max(ages)
        min_age = min(ages)
    if len(ages) == 2:
        mean_age = int((max_age + min_age) / 2)
    else:
        mean_age = max_age
    # specially added for 18 years cases
    only_18 = 0
    is_y = 0
    if ages == [18]:
        only_18 = 1
        if 'y' in _data or 'Y' in _data:
            is_y = 1
    under_18 = 0
    if 1 < max_age < 18:
        under_18 = 1
    above_65 = 0
    if mean_age >= 65:
        above_65 = 1
    # verifying whether digit is found in the given string or not.
    # Example - data='18-20' digits_found=True data='????' digits_found=False
    digits_found = 1
    if len_digits == 1:
        digits_found = 1
        max_age, min_age, mean_age, only_18, is_y, above_65, under_18 = 0, 0, 0, 0, 0, 0, 0
    elif len_digits == 0:
        digits_found, max_age, min_age, mean_age, only_18, is_y, above_65, under_18 = -1, -1, -1, -1, -1, -1, -1, -1
     
    feature = {
        'ages': tuple(ages),
        'len(ages)': len(ages),
        'spl_chr': spl_char,
        'is_digit': digits_found,
        'max_age': max_age,
        'mean_age': mean_age,
        'only_18': only_18,
        'is_y': is_y,
        'above_65': above_65,
        'under_18': under_18
    }

    return feature

##### Loading dataset

In [9]:
dataset = []
with open(age_file, newline='\n') as fp:
    input_data = csv.reader(fp, delimiter=',')
    for row in input_data:
        dataset.append((row[1:]))
feature_sets = [(actual, correction) for (actual, correction) in dataset]
random.shuffle(feature_sets)

##### creating feature matrix X and response vector y

In [10]:
feature_sets = [(feature_extraction(source), corrected) for (source, corrected) in feature_sets]

##### Visualizing Feature Matrix X

In [11]:
feature_val = [val[0]  for val in feature_sets]
feature_df = pd.DataFrame(feature_val)

In [12]:
feature_df.shape

(480, 10)

In [13]:
feature_df.sample(10)

Unnamed: 0,above_65,ages,is_digit,is_y,len(ages),max_age,mean_age,only_18,spl_chr,under_18
432,0,"(63,)",1,0,1,63,63,0,,0
154,0,"(32,)",1,0,1,32,32,0,,0
430,0,"(58,)",1,0,1,58,58,0,,0
407,0,"(34,)",1,0,1,34,34,0,,0
65,0,"(34,)",1,0,1,34,34,0,,0
462,1,"(65,)",1,0,1,65,65,0,>,0
97,1,"(65,)",1,0,1,65,65,0,,0
356,-1,(),-1,-1,0,-1,-1,-1,?,-1
180,0,"(20, 24)",1,0,2,24,22,0,-,0
451,1,"(66,)",1,0,1,66,66,0,,0


##### Train Test Split 

In [14]:
cut_point = int(len(feature_sets) * training_percent)
train_set, test_set = feature_sets[:cut_point], feature_sets[cut_point:]

##### NaiveBayes Classifier

In [15]:
nb_classifier = NaiveBayesClassifier.train(train_set)

In [16]:
print("Accuracy of NaiveBayesClassifier: {} ".format(classify.accuracy(nb_classifier, test_set)))

Accuracy of NaiveBayesClassifier: 0.9583333333333334 


In [17]:
print(nb_classifier.show_most_informative_features(10))

Most Informative Features
                 max_age = 65                65+ : 60 to  =     10.4 : 1.0
                 max_age = 59             55 to  : 50 to  =      7.2 : 1.0
               len(ages) = 2              60 to  : Under  =      6.3 : 1.0
                 spl_chr = ''                65+ : ?      =      5.8 : 1.0
                 max_age = 39             35 to  : 30 to  =      5.7 : 1.0
                 only_18 = 0              30 to  : ?      =      4.9 : 1.0
                under_18 = 0              30 to  : ?      =      4.9 : 1.0
                    is_y = 0              30 to  : ?      =      4.9 : 1.0
                above_65 = 0              30 to  : ?      =      4.9 : 1.0
               len(ages) = 1                 65+ : ?      =      4.9 : 1.0
None


##### Maxent Classifier

In [18]:
max_classifier = MaxentClassifier.train(train_set)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -2.63906        0.055
             2          -1.71509        0.930
             3          -1.30567        0.964
             4          -1.03310        0.964
             5          -0.84445        0.966
             6          -0.70872        0.992
             7          -0.60766        0.992
             8          -0.53016        0.992
             9          -0.46919        0.992
            10          -0.42015        0.992
            11          -0.37998        0.992
            12          -0.34653        0.992
            13          -0.31829        0.992
            14          -0.29417        1.000
            15          -0.27333        1.000
            16          -0.25517        1.000
            17          -0.23921        1.000
            18          -0.22508        1.000
            19          -0.21249        1.000
 

In [19]:
print("Accuracy of MaxentClassifier: {} ".format(classify.accuracy(max_classifier, test_set)))

Accuracy of MaxentClassifier: 0.9791666666666666 


In [20]:
print(max_classifier.show_most_informative_features(10))

   6.505 is_y==1 and label is '18 to 20'
  -6.145 spl_chr=='' and label is '?'
   4.921 ages==(7,) and label is '?'
   4.921 mean_age==0 and label is '?'
   4.921 max_age==0 and label is '?'
   4.000 ages==(61, 65) and label is '60 to 64'
   4.000 mean_age==63 and label is '60 to 64'
   3.910 ages==(18, 21) and label is '18 to 20'
   3.883 ages==(50, 59) and label is '50 to 54'
   3.845 ages==(56, 60) and label is '55 to 59'
None


##### Decision Tree Classifier

In [21]:
decision_classifier = DecisionTreeClassifier.train(train_set)

In [22]:
print("Accuracy of DecisionTreeClassifier: {} ".format(classify.accuracy(decision_classifier, test_set)))

Accuracy of DecisionTreeClassifier: 0.9166666666666666 


#### 4. Evaluation

In [23]:
print('Enter q (or) quit to end this test module')
while 1:
    data = input('\nEnter data for testing: ')
    if data.lower() == 'q' or data.lower() == 'quit':
        print('End')
        break

    if not len(data):
        continue

    features = feature_extraction(data)
    print(features)
    prediction = [nb_classifier.classify(features),
                  max_classifier.classify(features),
                  decision_classifier.classify(features)]

    print('NaiveBayes Classifier     : ', prediction[0])
    print('Maxent Classifier         : ', prediction[1])
    print('Decision Tree Classifier  : ', prediction[2])
    print('-'*75)
    print('(Best of 3) =              ', max(set(prediction), key=prediction.count))

Enter q (or) quit to end this test module

Enter data for testing: q
End
