**Explanation of data for future reference:
**1. flag: Whether the customer has bought the target product or not
2. gender: Gender of the customer
3. education: Education background of customer
4. house_val: Value of the residence the customer lives in
5. age: Age of the customer by group
6. online: Whether the customer had online shopping experience or not
7. customer_psy: Variable describing consumer psychology based on the area of residence
8. marriage: Marriage status of the customer
9. children: Whether the customer has children or not
10. occupation: Career information of the customer
11. mortgage: Housing Loan Information of customers
12. house_own: Whether the customer owns a house or not
13. region: Information on the area in which the customer are located
14. car_prob: The probability that the customer will buy a new car(1 means the maximum possibleï¼‰
15. fam_income: Family income Information of the customer(A means the lowest, and L means the highest)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
import missingno as mno
from sklearn import preprocessing
from sklearn.model_selection import train_test_split


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
print("Setup Complete")
# Any results you write to the current directory are saved as output.

In [None]:
#Load data
sales_data_filepath = "../input/individual-company-sales-data/sales_data.csv"
sales_data = pd.read_csv(sales_data_filepath, index_col=0, encoding="latin-1")
sales_data.sample(10)

In [None]:
#Visualize null data in array
mno.matrix(sales_data)

In [None]:
#delete marriage column, since it has so many missing data values
sales_data = sales_data.drop('marriage', axis=1)
sales_data.head()

In [None]:
#what are the unique values in education
sales_data['education'].unique()

I notice the following issues with the data:
* The house owner column has missing values, and is a binary (they either own a house, or they don't). So, I'm tempted to drop the column entirely. If the data were numerical, I could fill all null values with the mode or median.
* There are also missing values in the educational column, but few enough that we lose less data by removing every entry (row) that has a null value, rather than remove the column entirely
* The education column is using both numbers and phrases to describe the category, which is redundant. I would like to rename these.
* The age column is doing the same thing, and also doesn't provide tidy categories. Does under 45 also mean over 25, because the next youngest category cutoff is 25? I need to rename these.
* I don't know what the customer psychology or family income categories map to which characteristics. I posed this question on the discussion forum, but will remove the column for now. 
* There are some 'U's in the child category, which I take to mean unknown. Since this comprises almost one-fourth of the data entries, I'm going to delete the entire column.

I'm tempted to just remove anything I don't understand, but first I'll look at a correlation heatmap to see if I can't afford to lose any one category. Except I can't look at correlations between non-continuous, numerical variables. I would love to come back to this later, but for now I will just eliminate all the data that doesn't play nice.

In [None]:
#column deletions
sales_data = sales_data.drop(['customer_psy','child', 'house_owner', 'fam_income'], axis=1)

And it turns out I have no idea how to do the equivalent of Find & Replace All in a dataframe. I'm going to use iterrows, which is strongly warned against on stack overflow:
https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas

With iterrows, I want to be able to use iloc(). However, right now my index is not the number of the row, but the flag (Y/N). I want to be able to reference the index value of any row to instantly know which nth row that one is in the dataframe. So, I'm going to create a new column for the flag data and fill the index with ascending integers.

In [None]:
#create a new column for the flag
sales_data['flag'] = sales_data.index
#replace the old values in the index with ascending integers
sales_data.index = np.arange(len(sales_data))
#rename index using the rename_axis method
sales_data = sales_data.rename_axis('customerID')
sales_data.head()

In [None]:
#sales_data.iloc[[2],[0]] = "M"
#print(sales_data.iloc[[2],[0]])

In [None]:
#sales_data['flag'] = sales_data.index
#print(sales_data['flag'])

In [None]:
#clean strings while iterating through each row
'''
for index, row in sales_data.iterrows():
    if (row['education'] == '4. Grad'):
        row['education'] = '4'
    elif (row['education'] == '3. Bach'):
        row['education'] = '3'
    elif (row['education'] == '2. Some College'):
        row['education'] = '2'
       #print(row)
'''
#The above code appears to work when I print out the rows, but when I look at sales_data.head() again
# nothing seems to have changed. Credit to Rish Patel for showing me replace.

#On education
sales_data.replace('0. <HS', 'dropout', inplace=True)
sales_data.replace('1. HS', 'hs', inplace=True)
sales_data.replace('2. Some College', 'associates', inplace=True)
sales_data.replace('3. Bach', 'bachelors', inplace=True)
sales_data.replace('4. Grad', 'masters', inplace=True)

#On age
sales_data.replace('1_Unk', '1', inplace=True)
sales_data.replace('2_<=25', '2', inplace=True)
sales_data.replace('3_<=35', '3', inplace=True)
sales_data.replace('4_<=45', '4', inplace=True)
sales_data.replace('5_<=55', '5', inplace=True)
sales_data.replace('6_<=65', '6', inplace=True)
sales_data.replace('7_>65', '7', inplace=True)

#On mortgage
sales_data.replace('1Low', 'low', inplace=True)
sales_data.replace('2Med', 'medium', inplace=True)
sales_data.replace('3High', 'high', inplace=True)

In [None]:
#There are two ways to address the null values in the education column. You can either fill them or drop the entire row
# Uncomment the one you prefer (I'm opting to drop rows)

#1.Replace null values
#sales_data['education'].fillna('none', inplace=True)
#2.drop all columns that have a null value
sales_data = sales_data.dropna()

#Two different ways to check if data contains empty values
print('Columns with null values:\n', sales_data.isnull().sum())
print("-"*10)

Now, I need to work on **Feature Engineering**, which is described well here:
(credit to https://www.kaggle.com/mjbahmani/a-comprehensive-ml-workflow-with-python)

**6-4-2 Feature Encoding**
In machine learning projects, one important part is feature engineering. It is very common to see categorical features in a dataset. However, our machine learning algorithm can only read numerical values. It is essential to encoding categorical features into numerical values[28]

Encode labels with value between 0 and n_classes-1
LabelEncoder can be used to normalize labels.
It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

In [None]:
#Split data into testing and training sets
#This divides the data by absolute number of entries. However we want to divide by percent.
'''
train_data = sales_data[:200]
test_data = sales_data[200:]
'''
#train_test_split function is imported from sklearn. 
#test_size set so that 80% of data is used for training
train_data, test_data = train_test_split(sales_data,test_size=0.2)

In [None]:
def encode_features(df_train, df_test):
    features = ['gender', 'education','house_val','age','online','occupation','mortgage','region','car_prob']
    df_combined = pd.concat([df_train[features], df_test[features]])
    
    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(df_combined[feature])
        df_train[feature] = le.transform(df_train[feature])
        df_test[feature] = le.transform(df_test[feature])
    return df_train, df_test

train_data, test_data = encode_features(train_data, test_data)

In [None]:
#I want to see if all the data was successfully converted to numerical:
test_data.info()

In [None]:
#Separate data into features(x) and target(y)
x_all = train_data.drop(['flag'], axis=1)
y_all = train_data['flag']

In [None]:
#We're using train_test_split for the second time, but this time with additional parameters
num_test = 0.3
X_train, X_test, y_train, y_test = train_test_split(x_all, y_all, test_size=num_test, random_state=100)

In [None]:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
# Choose the type of classifier. 
rfc = RandomForestClassifier()

# Choose some parameter combinations to try
parameters = {'n_estimators': [4, 6, 9], 
              'max_features': ['log2', 'sqrt','auto'], 
              'criterion': ['entropy', 'gini'],
              'max_depth': [2, 3, 5, 10], 
              'min_samples_split': [2, 3, 5],
              'min_samples_leaf': [1,5,8]
             }

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(accuracy_score)

# Run the grid search
grid_obj = GridSearchCV(rfc, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rfc = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
rfc.fit(X_train, y_train)

In [None]:
rfc_prediction = rfc.predict(X_test)
rfc_score=accuracy_score(y_test, rfc_prediction)
print(rfc_score)

It worked! The output was:
**# 0.689164809508649**
It took about two minutes to compute