forked from marketneutral/misc_research
-
Notifications
You must be signed in to change notification settings - Fork 0
/
pipeline.html
100 lines (94 loc) · 6.06 KB
/
pipeline.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>pipeline.html</title>
</head>
<body>
<p>View in JupyterLab with <em>Open With → Markdown Preview</em>.</p>
<h1 id="machine-learning-plan">Machine Learning Plan</h1>
<p><br /><span class="math display"><em>y</em> = <em>m</em><em>x</em> + <em>b</em></span><br /></p>
<h2 id="classes">Classes</h2>
<ul>
<li>Introduction to Machine Learning for Coders, fast.ai [DONE]</li>
<li>Deep Learning, fast.ai [in progress]</li>
<li>Computational Linear Algebra, fast.ai [TBD]</li>
<li><a href="https://mlcourse.ai/">mlcourse.ai</a> [Schedule to start Sept 9, 2019]</li>
</ul>
<h2 id="reading">Reading</h2>
<ul>
<li>Deep Learning Book, <a href="http://www.deeplearningbook.org/">math prerequesites</a>. [TBD]</li>
<li>Feature Engineering Talk, <a href="https://www.slideshare.net/HJvanVeen/feature-engineering-72376750">slides link</a>. [TBD]</li>
</ul>
<h2 id="lectures">Lectures</h2>
<ul>
<li><a href="">3blue1brown Linear Algebra</a>.</li>
<li><a href="https://www.kaggle.com/owenzhang1">Owen Zhang</a> talk at NYC Data Academy (<a href="https://www.youtube.com/watch?v=LgLcfZjNF44">link</a>). Key ideas on model stacking (using glm on sparse and then feeding into xgb); using leave-one-out target encoding for high cardinality categorical variables; gbm tuning.</li>
<li><a href="">raddar</a> My Journey to Kaggle Grandmasster, Kaggle Days talk <a href="https://www.youtube.com/watch?v=7XEMPU17-Wo">link</a>.</li>
<li><a href="">CPMP</a> Beyond Feature Engineering and HPO, Kaggle Days talk <a href="https://www.youtube.com/watch?v=fH_FiquKhiI">link</a>.</li>
<li>Vincent W. Winning with Linear Models <a href="https://www.youtube.com/watch?v=68ABAU_V8qI">link</a>.</li>
<li>Vincent W. The Duct Tape of Heroes (Bayesian stats; pomegranate) <a href="https://www.youtube.com/watch?v=dE5j6NW-Kzg">link</a>.</li>
<li>Szilard, <span class="citation">@datascienceLA</span>, On Machine Learning Software <a href="http://videolectures.net/kdd2017_pafka_machine_learning_software/">link</a>.</li>
<li>Szilard, <span class="citation">@datascienceLA</span>, Better than Deep Learning: GBM <a href="https://www.youtube.com/watch?v=9GCEVv94udY">link</a></li>
<li>Tianqi Chen, XGBoost: A Scaleable Tree Boosting System, June 2016 talk at DataScienceLA <a href="https://www.youtube.com/watch?v=9GCEVv94udY">link</a></li>
</ul>
<h1 id="machine-learning-pipeline">Machine Learning Pipeline</h1>
<h2 id="eda">EDA</h2>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> pandas_profiling <span class="im">import</span> ProfileReport</code></pre></div>
<h2 id="data-cleaning">Data Cleaning</h2>
<ul>
<li><code>pyjanitor</code> https://github.com/ericmjl/pyjanitor</li>
</ul>
<h2 id="baseline-random-model">Baseline Random Model</h2>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> sklearn.dummy <span class="im">import</span> DummyRegressor
<span class="im">from</span> sklearn.metrics <span class="im">import</span> make_scorer
scorer <span class="op">=</span> make_scorer(mean_squared_error)
scores_dummy <span class="op">=</span> cross_val_score(baseline, train_df.values, y, cv<span class="op">=</span>RepeatedKFold(n_repeats<span class="op">=</span><span class="dv">100</span>), scoring<span class="op">=</span>scorer)</code></pre></div>
<h2 id="feature-importance-and-explanability">Feature Importance and Explanability</h2>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> eli5 <span class="im">import</span> show_weights
<span class="im">from</span> eli5.sklearn <span class="im">import</span> PermutationImportance
perm <span class="op">=</span> PermutationImportance(model, random_state<span class="op">=</span><span class="dv">1</span>).fit(X_train, y_train)
show_weights(perm, top<span class="op">=</span><span class="dv">50</span>, feature_names<span class="op">=</span>top_columns)</code></pre></div>
<ul>
<li><code>tree_iterpreter</code></li>
<li><code>shap</code></li>
<li>Jeremy's dendrogram code to inspect for redundant features</li>
<li>Jeremy's RF code to see if feature can predict if a sample is in/out of the test set. If it can, this means that</li>
</ul>
<h2 id="feature-engineering-and-encoding">Feature Engineering and Encoding</h2>
<ul>
<li>Category Encoding http://contrib.scikit-learn.org/categorical-encoding/index.html</li>
</ul>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">import</span> category_encoders <span class="im">as</span> ce
encoder <span class="op">=</span> ce.LeaveOneOutEncoder(cols<span class="op">=</span>[...])
encoder.fit(X, y)
X_cleaned <span class="op">=</span> encoder.transform(X_dirty)</code></pre></div>
<h2 id="imputing">Imputing</h2>
<div class="sourceCode"><pre class="sourceCode python"><code class="sourceCode python"><span class="im">from</span> fancyimpute <span class="im">import</span> KNN
X_filled_knn <span class="op">=</span> KNN(k<span class="op">=</span><span class="dv">3</span>).fit_transform(X_incomplete)</code></pre></div>
<h2 id="imbalanced-data-and-data-augmentation">Imbalanced Data and Data Augmentation</h2>
<ul>
<li>https://imbalanced-learn.org/en/stable/index.html</li>
</ul>
<h2 id="scaling-and-outlier-handling">Scaling and Outlier Handling</h2>
<h2 id="model-stacking">Model Stacking</h2>
<ul>
<li><code>vectack</code> package compat with <code>sklearn</code> api, https://github.com/vecxoz/vecstack</li>
</ul>
<h2 id="hyperparameter-optimzation">Hyperparameter Optimzation</h2>
<ul>
<li><code>BayesOptCV</code></li>
</ul>
<h2 id="blending">Blending</h2>
<h2 id="using-neptune">Using Neptune</h2>
<h1 id="automl">AutoML</h1>
<ul>
<li>Kaggle Days SF h2o, AutoML gets 8th place on Hackathon <a href="https://gist.github.com/ledell/4d4cd24b6a993a47069c511ba86b05bd">link</a></li>
</ul>
<h1 id="bayesian-learning">Bayesian Learning</h1>
<ul>
<li></li>
</ul>
</body>
</html>