# Hacker Rank Check List
- open https://regex101.com/
- open sample notebooks on local machine (iwaspoisoned3.ipynb - requests, TimeSeries_InterpExtrap, Practice_interp, bank_marketing)
- open GitHub repo
- open local notebook for testing -- copy standard template imports from this notebook
- Set up a timer so you can track your progress against the time limits
- Be ready to copy standard imports from this template into HackerRank

# Standard Template for HackerRank
- Note hacker rank doesn't seem to allow:
       - import of IPython to display dataframes
       - %matplotlib inline
       - import matplotlib.pyplot as plt
       - import matplotlib.dates as mdates
       - from pylab import rcParams
       - import matplotlib.pyplot as plt
       - import matplotlib.dates as mdates
       - from pylab import rcParams
       - rcParams['figure.figsize'] = 10, 6
       - plt.rc("font", size=14)

## Imports for Local Test

In [9]:
%matplotlib inline
from IPython.core.display import display, HTML 
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pylab import rcParams
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pylab import rcParams
rcParams['figure.figsize'] = 10, 6
plt.rc("font", size=14)

## Imports for HackerRank

In [13]:
import os, sys, re
import calendar
from datetime import datetime
from dateutil.relativedelta import *
from scipy.stats import linregress

import collections
from collections import defaultdict, OrderedDict
import itertools
from dateutil import parser

import pandas as pd
pd.set_option('display.max_columns', 100)

import numpy as np
import scipy
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt

import sympy
import requests
from bs4 import BeautifulSoup
from scipy.stats import mode
from scipy import interp

from sklearn import preprocessing, linear_model, metrics
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.cross_validation import cross_val_score

from sklearn.model_selection import GridSearchCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score, classification_report, roc_curve, auc
from sklearn.pipeline import Pipeline, FeatureUnion

# if __name__ == "__main__":

# HackerRank I/O
## Notes on HackerRank STDIn and STDOut
- https://www.hackerrank.com/challenges/solve-me-first/problem

## Reading a local CSV file
- lineterminator='\r'
- sep = ',', '\t'
- parse_dates = [0] # date column
- infer_datetime_format = True,
- date_parser = pd.to_datetime

### Set time-based index
- df_pass.set_index("month", inplace = True)

In [8]:
# example
num1 = int(input())
num2 = int(input())

1
2


# Quickly load sample data to local machine (via copy/paste clipboard)

In [None]:
df = pd.read_clipboard()

### Finding and loading local CSV's
- Date Time codes: https://docs.python.org/2/library/datetime.html
        Code	Meaning	Example
        %A	Weekday as full name.	      Wednesday
        %a	Weekday as abbrev. name:	    Wed
        %B	Month as locale’s full name.	June
        %b  Abbreviated month name:         Jun
        %d	Day of the month.	            06
        %m	Month as a number.	             6
        %Y	Four-digit year.	           2018
        %y	Two-digit year.	                18
        %H  is a 24-hour clock
        
- you can use parse dates when the information is in mulitple columns:
    parse_dates=[['Date', 'Time']]
    pd.to_datetime(df['Date'] + ' ' + df['Time'])

## STDIN READ when .tsv like data Data is expected
- zero line = number of samples in data set
- first line = header

In [None]:
t=int(sys.stdin.readline())
my_header = sys.stdin.readline().split()

data = sys.stdin.read().splitlines()
data = [re.split(r'\t', l) for l in data]
df = pd.DataFrame(data, columns= my_header)

In [None]:
# In case filedata needs to be located 
dir_path = os.path.dirname(os.path.realpath(__file__))
print(dir_path)
cwd = os.getcwd()
print(cwd)
files = os.listdir(os.curdir)
print(files)

# In case data resides in its own directory below
os.chdir(path)

# Verify the data-file has been located
fname = ""
with open(fname, 'r') as fin:
    print(fin.read())
    
# Load the data into a dataframe
data = pd.read_csv(r'the_path_in_the_remote_machine/fname', sep='\t',
                   index_col = 0, header = 1, names = ['charge_time','battery_time'],
                   parse_dates = [0], infer_datetime_format = True, date_parser = pd.to_datetime,
                   )

## Missing Values

In [None]:
## Use fillna to impute the missing values
df_bank['job'].fillna(df_bank['job'].value_counts().idxmax(), inplace=True)
df_bank['marital'].fillna(df_bank['marital'].value_counts().idxmax(), inplace=True)
df_bank['duration'].fillna(mode(df_bank['duration']).mode[0], inplace=True)

# Test Time Series for Seasonality

In [None]:
plt.figure()
smt.seasonal_decompose(df_pass).plot()
plt.gcf().set_size_inches(10, 6)
plt.show()

# Interview Questions

## Ensemble Methods: used to improve algorithm performance and/or improve robustness
  - Bootstrapping: central concept: random sampling with replacement. Gives model a chance to learn the vairous biases, variances and features within the data set -- even applicable in small data sets. With the increase in processing power, it is much more possible to run these multiple models in parallel than ever before. Two methodologies are: Boosting and bagging
  - Bagging: run multiple prediction models in parallel and aggregate their output. Helps reduce variance in cases where overfitting is a concern. Aggregation can take the form of voting (classification) or averaging (regression).
  - Boosting: same but you weight the model's output. Usually, start by running models with equal weights on the first pass. The model then keeps track of which samples are the most frequently miss-classified and gives them heavier weights -- requiring more iteration to properly train. The model error rates are also kept track of and the better models are given more weight. Boosting is most likely to pick the better of the models included in the ensemble. It can also reduce the bias in an underfit model.
<br>
- **Summary:** Booth boosting and bagging are techniques to decrease variance -- this is why most Kaggle winners use this type of approach.

## Sorting Algorithms
https://brilliant.org/wiki/sorting-algorithms/
![title](img/Sorting_algos.png) 

## Why Regularization?
[~] A way to reduce overfitting is penalize higher degree polynomials. This ensures that a higher degree polynomial is selected only if it reduces the error significantly compared to a simpler model, to overcome the penalty.  

[~] Occam's razor (or Ockham's razor) is a principle from philosophy. Suppose there exist two explanations for an occurrence. In this case the simplier one is usually better. Another way of saying it is that the more assumptions you have to make, the more unlikely an explanation.

## Coefficients of Variation (like a k-factor)
In probability theory and statistics, the coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is often expressed as a percentage, and is defined as the ratio of the standard deviation to the mean. Also used to measure volatility of a security in finance

Advantages
The coefficient of variation is useful because the standard deviation of data must always be understood in the context of the mean of the data. In contrast, the actual value of the CV is independent of the unit in which the measurement has been taken, so it is a dimensionless number. For comparison between data sets with different units or widely different means, one should use the coefficient of variation instead of the standard deviation.

Disadvantages
When the mean value is close to zero, the coefficient of variation will approach infinity and is therefore sensitive to small changes in the mean. This is often the case if the values do not originate from a ratio scale.

## Non Convex Optimizaiton Methods

non-convex optimization is at least NP-hard -- worried about multiple saddle points and local minima.  Stochastic Gradient Descent

## Recommendation Engine
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/
A recommendation engine filters the data using different algorithms and recommends the most relevant items to users. It first captures the past behavior of a customer and based on that, recommends products which the users might be likely to buy.

- We can recommend items to a user which are most popular among all the users
- We can divide the users into multiple segments based on their preferences (user features) and recommend items to them based on the segment they belong to

2.3.1 Content filtering (e.g. similarty of terms -cosine distance)
2.3.2 Collaborative filtering