# Theorical Questions

<b> Q: Why data mining is a misnormer? What is another preferred name? </b><br />
<b>A</b>: The term "data mining" is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. Another preferred name of Data Mining is Knowledge Discovery in Datasets(KDD).

source: https://en.wikipedia.org/wiki/Data_mining#cite_note-han-kamber-6

<b>Q: What is the general knowledge disocvery process? </b><br />
<b>A</b>: First is developing an understanding of the application  domain  and  the  relevant  prior
knowledge  and  identifying  the  goal  of  the KDD process from the customer’s viewpoint.

Second  is  creating  a  target  data  set:  selecting a data set, or focusing on a subset of vari-
ables  or  data  samples,  on  which  discovery  is to be performed.

Third  is  data  cleaning  and  preprocessing. Basic  operations  include  removing  noise  if
appropriate, collecting the necessary information  to  model  or  account  for  noise,  deciding
on strategies for handling missing data fields, and  accounting  for  time-sequence  information
and known changes.

Fourth  is  data  reduction  and  projection: finding  useful  features  to  represent  the  data
depending  on  the  goal  of  the  task.  With  dimensionality  reduction  or  transformation 
methods,  the  effective  number  of  variables under  consideration  can  be  reduced,  or  invariant  representations  for  the  data  can  be found.

Fifth is matching the goals of the KDD process  (step  1)  to  a  particular  data-mining
method.  For  example,  summarization,  classification,  regression,  clustering,  and  so  on,
are described later as well as in Fayyad, Piatetsky-Shapiro, and Smyth (1996).

Sixth  is  exploratory  analysis  and  model and  hypothesis  selection:  choosing  the  data-
mining  algorithm(s)  and  selecting  method(s) to  be  used  for  searching  for  data  patterns.
This  process  includes  deciding  which  models and parameters might be appropriate (for ex-
ample,  models  of  categorical  data  are  different than models of vectors over the reals) and
matching  a  particular  data-mining  method with  the  overall  criteria  of  the  KDD  process
(for example, the end user might be more interested  in  understanding  the  model  than  its
predictive capabilities).

Seventh  is  data  mining:  searching  for  patterns  of  interest  in  a  particular  representa-
tional  form  or  a  set  of  such  representations, including  classification  rules  or  trees,  regres-
sion, and clustering. The user can significantly  aid  the  data-mining  method  by  correctly
performing the preceding steps.

Eighth  is  interpreting  mined  patterns,  possibly returning to any of steps 1 through 7 for
further  iteration.  This  step  can  also  involve visualization  of  the  extracted  patterns  and
models  or  visualization  of  the  data  given  the extracted models.

Ninth  is  acting  on  the  discovered  knowledge:  using  the  knowledge  directly,  incorpo-
rating the knowledge into another system for further action, or simply documenting it and
reporting it to interested parties. This process also  includes  checking  for  and  resolving  po-
tential  conflicts  with  previously  believed  (or extracted) knowledge.

source: https://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf

<b>Q: In Data Mining, what is the difference between prediction and categorization?</b><br />
<b>A: </b> Classification is the process of identifying the category or class label of the new observation to which it belongs.  Predication is the process of identifying the missing or unavailable numerical data for a new observation. That is the key difference between classification and predication.

source: https://www.differencebetween.com/difference-between-classification-and-vs-prediction/

<b>Q: In a linear model which regularization mehod encourages sparsity? </b><br />
<b>A:</b> Lasso Regression or L1 encourages more sparsity. Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function. Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.

source: https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c

<b>Q: Why is there a need for Gradiant Descent for optimization? </b><br />
<b>A:</b> Gradient descent is an iterative optimization algorithm used in machine learning to minimize a loss function. The loss function describes how well the model will perform given the current set of parameters (weights and biases), and gradient descent is used to find the best set of parameters. We use gradient descent to update the parameters of our model. For example, parameters refer to coefficients in Linear Regression and weights in neural networks.

source: https://www.kdnuggets.com/2020/05/5-concepts-gradient-descent-cost-function.html

<b> Q: Explain Bias and Variance in terms of underfitting and overfitting? </b> <br />
<b> A: </b> An ideal machine learning model should have low bias and low variance. Low bias: to provide a flexible fit into the training data set. Low variance: to get consistent results accross various datasets.

High Bias, Low Variance => underfitting <br />
Low Bias, High Variance => overfitting

<b> Q: Why data science/machine learning is a bad idea in the context of information security?  </b> <br />
<b> A: </b> There are majorly 3 security risks: <br />
a) Adversarial sample: An adversarial example is an instance with small, intentional feature perturbations that cause a machine learning model to make a false prediction. <br />
b) Backdoor attack: Break system integrity <br />
c) Information Leak: Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed. <br />
d) Ethical Issue: Data driven technologies such as AI, can potentially replicate the preconception and biases of their designers. <br />

# Wish.com: Product Rating Prediction

In [86]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from pprint import pprint

In [87]:
# sample => returns random samples of data from the data set.
# parameters for sample => 
# a) n => number of samples to be returned
# b) frac => fraction, meaning the fraction of dataset that is returned.
# e.g: frac=.4 would mean 40% of the sample will be returned
data = pd.read_csv('train_new.csv').sample(frac=1) #shuffle
data = data.loc[data['rating'].isin([1, 2, 3, 4, 5])]
data = data.fillna(0)

# drop => drop either a particular row or a column
# parameters for drop =>
# a) axis = 0; drops a row
# b) axis =1; drops a column
# removing cols with same val
data = data.drop(['merchant_id', 'merchant_profile_picture', 'id', 'tags'], axis=1)

# Cleaning the Data

### Deleting columns that have only one type of data

In [88]:
# Remove all the features that has only one value, these changes will not 
# bring much difference to the actual result.
# To check the value in the dataset you can do something like:

# for column_name in data.columns:
#   print(column_name, data[coulumn_name].unique(), '\n\n')
del    data['theme'], data['crawl_month'], data['currency_buyer']

### Generating profile reports from a pandas DataFrame

In [None]:
# Using Profile Report to understand the patterns in the data

import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport


df = data[[
    'price', 'retail_price', 'units_sold', 'uses_ad_boosts',
       'rating_count', 'badges_count', 'badge_local_product',
       'badge_product_quality', 'badge_fast_shipping',
       'product_variation_size_id', 'product_variation_inventory',
       'shipping_option_name', 'shipping_option_price', 'shipping_is_express',
       'countries_shipped_to', 'inventory_total', 'has_urgency_banner',
       'merchant_rating_count', 'merchant_rating',
       'merchant_has_profile_picture'
]]
profile = ProfileReport(df, title="Pandas Profiling Report")

profile.to_file("report.html")

### Taking log for highly skewed columns

In [89]:
# Histogram show that the feilds like `rating_count`, `units_sold`, `retail_price` are
# highly skewed therefore taking their log to normalize the distribution.

def getLog(x):
    return np.log(x + 1)


data['rating_count']  = data['rating_count'].apply(getLog)
data['units_sold']  = data['units_sold'].apply(getLog)
data['retail_price']  = data['retail_price'].apply(getLog)

### Identify and remove columns with high co-relation

In [90]:
from scipy.stats import spearmanr as r
r(data['shipping_option_price'],  data['price'])

SpearmanrResult(correlation=0.8718981759549496, pvalue=0.0)

In [91]:
del    data['shipping_option_price']

# Split training dataset  into training dataset and validation dataset.

In [92]:
msk = np.random.rand(len(data)) < 0.7
tr = data[msk]
val = data[~msk]
tr, val

(      price  retail_price  units_sold  uses_ad_boosts  rating  rating_count  \
 275    6.00      2.484907    4.615121               0     3.0      1.609438   
 892    6.00      1.945910    4.615121               0     4.0      2.995732   
 592    5.67      2.995732   11.512935               0     4.0      9.819780   
 927   11.00      2.397895    6.908755               1     5.0      6.025866   
 973    5.85      1.791759    9.210440               0     4.0      6.912743   
 ...     ...           ...         ...             ...     ...           ...   
 405   12.00      2.484907    6.908755               0     4.0      6.735780   
 247   11.00      2.397895    1.098612               0     2.0      0.693147   
 152   12.00      3.465736    6.908755               0     4.0      5.843544   
 316    7.00      1.945910    6.908755               1     4.0      4.682131   
 1093   7.00      1.945910    6.908755               1     3.0      4.955827   
 
       badges_count  badge_local_produ

# Creating Categorical Data

In [93]:
dict_cat = {}

# columns that are of categorical value
cat_cols = tr.columns[tr.dtypes==object].to_list()

def cat_digit(col):  
    # build the mapping
    # category type:
    # To convert the object type (usually strings) into integers we can use the category type
    # Another benifit of doing so is the memory space utilization, when using categories you will occupy less space
    # df.info(memory_usage=deep)
    encoded = col.astype('category').cat.codes
    
    # store the mapping
    # converting two lists into dictionary using the zip
    # zip:
    # keys = ['a', 'b', 'c']
    # values = [1, 2, 3]
    # dictionary = dict(zip(keys, values))
    # print(dictionary) # {'a': 1, 'b': 2, 'c': 3}
    dict_cat[col.name] = dict(zip(np.asarray(col), np.asarray(encoded)))
    return encoded

# for each categorical feature, apply cat_digit where we build the mapping and transform the data
# this is for the training set (where we build the mapping)
tr[cat_cols] = tr[cat_cols].apply(lambda col: cat_digit(col))
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,price,retail_price,units_sold,uses_ad_boosts,rating,rating_count,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,...,inventory_total,has_urgency_banner,urgency_text,origin_country,merchant_title,merchant_name,merchant_info_subtitle,merchant_rating_count,merchant_rating,merchant_has_profile_picture
711,14.00,2.564949,6.908755,0,5.0,6.063785,1,0,1,0,...,50,1.0,Quantité limitée !,CN,Superman Fashion Store,supermanfashionstore,"(1,178 notes)",1178,4.223260,0
275,6.00,2.484907,4.615121,0,3.0,1.609438,0,0,0,0,...,50,0.0,0,CN,myunxiaodian,myunxiaodian,"85 % avis positifs (5,264 notes)",5264,4.032865,0
892,6.00,1.945910,4.615121,0,4.0,2.995732,0,0,0,0,...,50,0.0,0,CN,Sangboo Store,sangboostore,"82 % avis positifs (10,600 notes)",10600,3.867547,0
592,5.67,2.995732,11.512935,0,4.0,9.819780,0,0,0,0,...,50,1.0,Quantité limitée !,CN,fashionstore0408,fashionstore0408,"84 % avis positifs (19,248 notes)",19248,3.889131,0
927,11.00,2.397895,6.908755,1,5.0,6.025866,1,1,0,0,...,50,0.0,0,CN,huanfeng,huanfeng,"93 % avis positifs (17,922 notes)",17922,4.380203,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
405,12.00,2.484907,6.908755,0,4.0,6.735780,0,0,0,0,...,50,0.0,0,CN,myunxiaodian,myunxiaodian,"85 % avis positifs (5,264 notes)",5264,4.032865,0
247,11.00,2.397895,1.098612,0,2.0,0.693147,0,0,0,0,...,50,0.0,0,CN,"Guangzhou Fashion Girl co.,ltd",guangzhoufashiongirlcoltd,"86 % avis positifs (21,307 notes)",21307,3.999061,0
152,12.00,3.465736,6.908755,0,4.0,5.843544,0,0,0,0,...,50,1.0,Quantité limitée !,CN,wuguipanwei2016,wuguipanwei2016,85% Feedback pozitiv (639 rating),639,4.007825,1
316,7.00,1.945910,6.908755,1,4.0,4.682131,0,0,0,0,...,50,1.0,Quantité limitée !,CN,alabaostore,alabaostore,83 % avis positifs (247 notes),247,3.943320,0


In [94]:
temp = cat_digit(data.shipping_option_name)
print(temp.describe())
print(temp.unique())
data.shipping_option_name.unique()

count    1093.000000
mean        4.060384
std         0.771525
min         0.000000
25%         4.000000
50%         4.000000
75%         4.000000
max        12.000000
dtype: float64
[ 4  3  5  0  6  7 12  8  1 10 11  2  9]


array(['Livraison standard', 'Livraison Express', 'Spedizione standard',
       'Envio Padrão', 'Standard Shipping', 'Standardowa wysyłka',
       'ការដឹកជញ្ជូនតាមស្តង់ដារ', 'Standardversand', 'Envío normal',
       'Стандартная доставка', 'الشحن القياسي', 'Expediere Standard',
       'Standart Gönderi'], dtype=object)

In [95]:
print('categorical features')
pprint(list(dict_cat.keys()))

categorical features
['product_color',
 'product_variation_size_id',
 'shipping_option_name',
 'urgency_text',
 'origin_country',
 'merchant_title',
 'merchant_name',
 'merchant_info_subtitle']


In [96]:
print('Lets see what the mapping for column origin_country :')
pprint(dict_cat['origin_country'])
print('It is a string to integer mapping')

Lets see what the mapping for column origin_country :
{0: 0, 'CN': 1, 'GB': 2, 'SG': 3, 'US': 4, 'VE': 5}
It is a string to integer mapping


In [97]:
# then we will use the mappings built from the training set, to transform the validation set
val[cat_cols] = val[cat_cols].apply(lambda col: col.map(dict_cat[col.name]))
# for string values that not seen in training set, we replace it with -1
val = val.fillna(-1)
val

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,price,retail_price,units_sold,uses_ad_boosts,rating,rating_count,badges_count,badge_local_product,badge_product_quality,badge_fast_shipping,...,inventory_total,has_urgency_banner,urgency_text,origin_country,merchant_title,merchant_name,merchant_info_subtitle,merchant_rating_count,merchant_rating,merchant_has_profile_picture
711,14.00,2.564949,6.908755,0,5.0,6.063785,1,0,1,0,...,50,1.0,1,1,-1.0,-1.0,-1.0,1178,4.223260,0
885,7.00,2.302585,9.210440,1,5.0,7.516977,1,0,1,0,...,50,1.0,1,1,-1.0,-1.0,-1.0,8727,4.161797,0
780,2.89,2.079442,3.931826,0,5.0,0.693147,0,0,0,0,...,50,0.0,0,1,-1.0,-1.0,-1.0,30,3.733333,0
835,8.00,4.343805,4.615121,1,4.0,2.484907,0,0,0,0,...,50,1.0,1,1,-1.0,-1.0,-1.0,70,3.514286,0
699,8.00,2.302585,9.210440,1,3.0,6.654153,0,0,0,0,...,50,0.0,0,1,-1.0,-1.0,-1.0,54504,3.801684,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1084,2.66,1.386294,4.615121,1,3.0,2.564949,0,0,0,0,...,50,1.0,1,1,374.0,208.0,-1.0,66644,4.137582,1
416,6.00,1.945910,8.517393,1,4.0,7.114769,0,0,0,0,...,50,0.0,0,1,10.0,10.0,364.0,14676,4.074203,0
738,3.83,1.609438,8.517393,1,4.0,5.823046,0,0,0,0,...,50,0.0,0,1,-1.0,-1.0,-1.0,99283,4.285598,0
315,5.00,1.791759,8.517393,0,4.0,6.507278,0,0,0,0,...,50,1.0,1,1,338.0,574.0,255.0,32168,3.884544,0


# Model Building: Training Data

In [98]:
# y is the response
# x is exploratory
tr_y = tr['rating']
tr_x = tr.drop('rating', axis=1)

print("this is x")
print(tr_x.head())
print("this is y")
print(tr_y.head())

this is x
     price  retail_price  units_sold  uses_ad_boosts  rating_count  \
275   6.00      2.484907    4.615121               0      1.609438   
892   6.00      1.945910    4.615121               0      2.995732   
592   5.67      2.995732   11.512935               0      9.819780   
927  11.00      2.397895    6.908755               1      6.025866   
973   5.85      1.791759    9.210440               0      6.912743   

     badges_count  badge_local_product  badge_product_quality  \
275             0                    0                      0   
892             0                    0                      0   
592             0                    0                      0   
927             1                    1                      0   
973             0                    0                      0   

     badge_fast_shipping  product_color  ...  inventory_total  \
275                    0             66  ...               50   
892                    0             32  ...    

In [99]:
# building the model object
clf = LogisticRegression().fit(tr_x, tr_y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [100]:
# val is the testing set (aka... validation set)
val_y = val['rating']
val_x = val.drop('rating', axis=1)


In [101]:
# predicting on validation(testing set) using clf trained model

pred_val = clf.predict(val_x)

pred_val = list(map(lambda x: int(x), pred_val))

pred_val[:10]

[4, 4, 4, 4, 4, 4, 4, 4, 4, 4]

In [102]:
# this is one model metric!


val_score = f1_score(val_y, pred_val, average='micro')
print(val_score)


'''
# Get othere metrics from here!!!!!
https://neptune.ai/blog/performance-metrics-in-machine-learning-complete-guide    
'''

0.6858974358974359


'\n# Get othere metrics from here!!!!!\nhttps://neptune.ai/blog/performance-metrics-in-machine-learning-complete-guide    \n'


# Model Building: Testing Data

In [46]:
# once you are happy with your local model, let's prepare a submission
# we need to apply the same preprocessing steps on the testing set as you did before you train the model

test_data = pd.read_csv('test_new.csv').sample(frac=1) 
_id = test_data['id']
test_data = test_data.fillna(0)
test_data = test_data.drop(['merchant_id', 'merchant_profile_picture', 'id', 'tags'], axis=1)
test_data[cat_cols] = test_data[cat_cols].apply(lambda col: col.map(dict_cat[col.name]))

# again, not-seen string value filled with -1
test_data = test_data.fillna(-1)

In [47]:
pred_test = clf.predict(test_data)
pred_df = pd.DataFrame(data={'id': np.asarray(_id), 'rating': pred_test})
pred_df.head()
# pred_df.to_csv('pred_walkthrough.csv', index=False)
pred_df.to_csv('pred_walkthrough.csv', index=False)

Unnamed: 0,id,rating
0,1380,4.0
1,1342,4.0
2,1413,4.0
3,367,4.0
4,369,4.0


In [48]:
pred_df.head(20)

Unnamed: 0,id,rating
0,1380,4.0
1,1342,4.0
2,1413,4.0
3,367,4.0
4,369,4.0
5,557,4.0
6,1443,4.0
7,931,4.0
8,1401,4.0
9,401,4.0
