# Project 3: Feature Selection + Classification

### Domain and Data

You're working as a data scientist with a research firm. You're firm is bidding on a big project that will involve working with thousands or possibly tens of thousands of features. You know it will be impossible to use conventional feature selection techniques. You propose that a way to win the contract is to demonstrate a capacity to identify relevant features using machine learning. Your boss says, "Great idea. Write it up." You figure that working with a synthetic dataset such as [Madelon](https://archive.ics.uci.edu/ml/datasets/Madelon) is an excellent way to demonstrate your abilities. 

#### Requirement

This work must be done on AWS.

### Problem Statement

Your challenge here is to develop a series of models for two purposes:

1. for the purposes of identifying relevant features. 
2. for the purposes of generating predictions from the model. 

### Solution Statement

Your final product will consist of:

1. A prepared report
2. A series of Jupyter notebooks to be used to control your pipelines

### Tasks

#### Data Manipulation

You should do substantive work on at least six subsets of the data. 

- 3 sets of 10% of the data from the UCI Madelon set
- 3 sets of 10% of the data from the Madelon set made available by your instructors (20000 rows, 1001 columns)

##### Prepared Report

Your report should:

1. be a pdf
2. include EDA of each subset 
   - EDA needs may be different depending upon subset or your approach to a solution
3. present results from Step 1: Benchmarking
4. present results from Step 2: Identify Salient Features
5. present results from Step 3: Feature Importances
6. present results from Step 4: Build Model

##### Jupyter Notebook, EDA 

- perform EDA on each set as you see necessary

##### Jupyter Notebook, Step 1 - Benchmarking
- build pipeline to perform a naive fit for each of the base model classes:
	- logistic regression
	- decision tree
	- k nearest neighbors
	- support vector classifier
- in order to do this, you will need to set a high `C` value in order to perform minimal regularization, in the case of logistic regression and support vector classifier.  An example is 1E10

##### Jupyter Notebook, Step 2 - Identify Features
- Build feature selection pipelines using at least three different techniques
- **NOTE**: these pipelines are being used for feature selection not prediction

##### Jupyter Notebook, Step 3 - Feature Importance
- Use the results from step 2 to discuss feature importance in the dataset
- Considering these results, develop a strategy for building a final predictive model
- recommended approaches:
    - Use feature selection to reduce the dataset to a manageable size then use conventional methods
    - Use dimension reduction to reduce the dataset to a manageable size then use conventional methods
    - Use an iterative model training method to use the entire dataset
   
##### Jupyter Notebook, Step 4 - Build Model
- Implement your final model
- (Optionally) use the entire data set

---

### Requirements

- Many Jupyter Notebooks
- A written report of your findings that detail the accuracy and assumptions of your model.

---

### Suggestions

- Document **everything**.



In [1]:
!conda install psycopg2 --yes
import psycopg2 as pg2
from psycopg2.extras import RealDictCursor
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
%matplotlib inline

Fetching package metadata ...........
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda:
#
psycopg2                  2.7.3.1                  py36_0    conda-forge


In [2]:
!pip install tqdm



In [3]:
from tqdm import tqdm

In [42]:
def con_cur_to_class_db():
    con = pg2.connect(host='34.211.227.227',
                  dbname='postgres',
                  user='postgres')
    cur = con.cursor(cursor_factory=RealDictCursor)
    return con, cur

con, cur = con_cur_to_class_db()
cur.execute('SELECT * FROM madelon LIMIT 20;')
mad_loc = cur.fetchall()
con.close()

pd.DataFrame(mad_loc)

# """
# SELECT delicatessen, detergents_paper, fresh, frozen, grocery, milk 
#   FROM customer
#   ORDER BY random()
#   LIMIT 44;
# """

#   order of how fast jupyter returns slowest to fastest (using %timeit and %%timeit, 
# 'Orderby random()', TABLESAMPLE Bernulli(percentage), TABLESAMPLE System(percentage)

Unnamed: 0,_id,feat_000,feat_001,feat_002,feat_003,feat_004,feat_005,feat_006,feat_007,feat_008,...,feat_991,feat_992,feat_993,feat_994,feat_995,feat_996,feat_997,feat_998,feat_999,target
0,65008,0.233804,0.021645,-0.005741,0.053653,0.568006,1.012115,0.640655,0.068871,0.836866,...,0.282311,0.933816,0.08638,0.525392,0.090488,-0.213955,-0.641252,-0.467539,1.54284,1
1,65009,1.269663,0.208082,0.797997,-1.178071,0.34659,1.065299,-1.105136,0.384606,-0.594521,...,0.881328,2.506352,-0.93592,0.153322,0.644222,1.890829,-0.926437,0.547594,-0.097117,1
2,65010,-1.352447,0.410641,-0.62366,0.765278,0.304788,0.423155,-1.243989,1.591946,-0.015261,...,-0.9364,-0.894419,-0.549392,-0.220249,-1.708634,-1.264171,-1.067263,0.177068,-0.664339,1
3,65011,-1.095287,-0.415433,0.734779,0.585515,0.798698,0.888102,-1.082829,-0.526273,0.900528,...,-0.891968,2.695504,1.168979,0.278184,0.357588,-1.124858,2.313678,-0.386337,0.337246,0
4,65012,-1.480449,-1.262458,-0.082684,0.342265,0.266731,-1.538401,-0.284153,0.125305,-0.713936,...,-0.539757,-1.167354,-0.222483,0.114279,0.781058,-0.426658,-0.06503,-0.948853,-0.334696,0
5,65013,-1.189615,0.020583,0.814352,1.050721,-0.09127,-0.795187,0.824699,0.963618,-0.867851,...,-0.668746,1.286032,0.645456,1.172072,0.097918,0.262785,-1.324713,1.358479,0.209353,0
6,65014,0.839902,1.900529,-1.893551,-0.113232,-0.430092,-0.94859,2.48657,0.863517,-0.493195,...,-0.339106,-0.562696,0.947945,0.722753,-0.976936,0.123115,0.103122,-0.315806,-0.280173,1
7,65015,-0.06735,1.453597,0.118043,0.555425,-0.273903,0.1547,0.208757,1.140211,-0.396575,...,0.820969,0.342802,0.722467,1.730188,-0.538576,1.332398,-0.373137,0.329661,-1.697601,1
8,65016,0.598659,0.131134,-1.448382,-0.590257,0.762584,-1.911921,-0.094632,0.449598,1.771757,...,-1.178852,-0.281247,-0.844696,-1.04981,-0.592285,0.720303,-1.448775,0.127883,-0.933267,0
9,65017,0.565466,-0.711125,-1.233483,-2.444587,1.075742,-1.355291,-0.450825,0.04428,0.063351,...,-0.861695,-0.593333,0.692867,0.199004,-1.141359,-1.651427,-1.903584,-2.15325,-1.257915,0


In [None]:
# df.to_csv('example.csv')

In [None]:
for i in range(500):
    print(i, mean_r2_for_feature(etc, etc))
    
%timeit

In [None]:
pd.DataFrame(results)

In [None]:
# sns.pairplot(sample_10pct_1, kind='reg')

In [31]:
# local files

def import_and_labels(data, labels):
    
    data = data
    labels = labels
    madelon_train = pd.read_csv(data, delimiter=' ', header=None)
    madelon_labels = pd.read_csv(labels, delimiter=' ', header=None)
    madelon_train[500] = madelon_labels

    return madelon_train


In [32]:
madelon_train = import_and_header('madelon_train.data.csv','madelon_train.labels.csv')

In [None]:
madelon_train

In [33]:
np.random.seed(42)
msample1 = madelon_train.sample(200)
msample2 = madelon_train.sample(200)
msample3 = madelon_train.sample(200)

In [None]:
display(madelon_train.describe())
display(msample1.describe())
display(msample2.describe())
display(msample3.describe())

In [None]:
display(abs((msample1.mean() - madelon_train.mean())/madelon_train.std()).mean())
display(abs((msample2.mean() - madelon_train.mean())/madelon_train.std()).mean())
display(abs((msample3.mean() - madelon_train.mean())/madelon_train.std()).mean())


In [None]:
display((abs((msample1.mean() - madelon_train.mean())/madelon_train.std()) > 0.1).value_counts())
display((abs((msample2.mean() - madelon_train.mean())/madelon_train.std()) > 0.1).value_counts())
display((abs((msample3.mean() - madelon_train.mean())/madelon_train.std()) > 0.1).value_counts())

# very few of these are more than 0.1 std above the mean, and none are more than 0.2 above the mean.  that's pretty good!

In [None]:
# do unsupervised learning on X.  compare a column in X to anything else to look for noise

from sklearn.tree import DecisionTreeRegressor
train_test_split

then r2_for_feature

# so basically the work flow is like 1) R2 to find the non-noisy features, 
# 2) PCA to find the significant features, 
# 3) grid search so u can laugh hysterically as ur score rises by .03%?

In [34]:
def calculate_r_2_for_feature(data,feature,regression_method):
    new_data = data.drop(feature, axis=1)

    X_train, X_test, y_train, y_test = train_test_split(new_data,data[feature],test_size=0.25)

    regressor = regression_method
    regressor.fit(X_train,y_train)

    score = regressor.score(X_test,y_test)
    if score > 0:
        return feature, score

def mean_r2_for_feature(data, feature):
    scores = []
    for _ in range(10):
        scores.append(calculate_r_2_for_feature(data, feature))
        
    scores = np.array(scores)
    return scores.mean()



In [8]:
sample1_features = []
sample2_features = []
sample3_features = []

for i in range(501):
    feature1 = mean_r2_for_feature(msample1, i)
    sample1_features.append(feature1)
    print(i, feature1)
for i in range(501):
    feature2 = mean_r2_for_feature(msample2, i)
    sample2_features.append(feature2)
    print(i, feature2)
for i in range(501):
    feature3 = mean_r2_for_feature(msample3, i)
    sample3_features.append(feature3)
    print(i, feature3)

0 -0.24768461002
1 -0.195852611852
2 -0.292481209294
3 -0.200510318964
4 -0.33975191563
5 -0.224609629793
6 -0.125333296143
7 -0.297948615225
8 -0.210867319502
9 -0.314242548217
10 -0.154004100201
11 -0.276678946447
12 -0.29771735403
13 -0.239761741448
14 -0.277118794306
15 -0.18162346575
16 -0.359404073135
17 -0.237282801215
18 -0.168848506964
19 -0.229445540586
20 -0.237075198406
21 -0.226218543289
22 -0.317015431413
23 -0.0890635518779
24 -0.141329864962
25 -0.137634175123
26 -0.181238990838
27 -0.263095559139
28 0.509665258716
29 -0.224670349113
30 -0.230736620788
31 -0.231450241457
32 -0.206985547106
33 -0.257728978448
34 -0.279603576946
35 -0.162512533689
36 -0.228606851645
37 -0.164899086945
38 -0.187928989885
39 -0.186456442985
40 -0.140584584974
41 -0.234002872554
42 -0.212243007293
43 -0.239229491834
44 -0.307971395587
45 -0.202669929596
46 -0.259887114619
47 -0.247293932389
48 0.301732778454
49 -0.135449446228
50 -0.158612615783
51 -0.163646693615
52 -0.313826473573
53 -0.29

418 -0.23722495616
419 -0.28299533151
420 -0.141002558532
421 -0.184420993868
422 -0.0950561106433
423 -0.193915863447
424 -0.183034242968
425 -0.275767883482
426 -0.239664555487
427 -0.260939667818
428 -0.238456388312
429 -0.216069382287
430 -0.289552135284
431 -0.160732700978
432 -0.262134889078
433 0.794806469337
434 -0.215662062336
435 -0.225524449801
436 -0.168905877177
437 -0.31709447699
438 -0.296083431416
439 -0.199436571155
440 -0.285257366794
441 -0.173038816027
442 0.630317997249
443 -0.34907517609
444 -0.295147533008
445 -0.331653939718
446 -0.290331678679
447 -0.129263416108
448 -0.321245341071
449 -0.220262738276
450 -0.268382343838
451 0.493347222987
452 -0.294563456496
453 0.816442582923
454 -0.281724900352
455 0.817743526954
456 -0.241057342408
457 -0.187244359316
458 -0.293786309084
459 -0.228306739669
460 -0.338888097366
461 -0.232582769254
462 -0.319441790261
463 -0.21610634235
464 -0.175127906409
465 -0.267449017333
466 -0.208318663871
467 -0.121642584896
468 -0.21

336 0.769846188844
337 -0.246665036138
338 0.83433705226
339 -0.100723956019
340 -0.421359356885
341 -0.23378022924
342 -0.157940923575
343 -0.279354825886
344 -0.27063163279
345 -0.176350329986
346 -0.211114287519
347 -0.255393790323
348 -0.25407485129
349 -0.249386628273
350 -0.237570438465
351 -0.274430866074
352 -0.321697717315
353 -0.136691710642
354 -0.150355533845
355 -0.251714880846
356 -0.208689124649
357 -0.12523759834
358 -0.330220459862
359 -0.263905041626
360 -0.253303711805
361 -0.153682879302
362 -0.167146604847
363 -0.23372983794
364 -0.178082837472
365 -0.280514578745
366 -0.263026150556
367 -0.123027269416
368 -0.21631491826
369 -0.218736774257
370 -0.154289138954
371 -0.304133630409
372 -0.22059103615
373 -0.387800938992
374 -0.408189844322
375 -0.222505617134
376 -0.17680326167
377 -0.221154029912
378 0.191567787166
379 -0.283134342287
380 -0.150608485663
381 -0.325802080703
382 -0.178437114428
383 -0.213589092874
384 -0.283496523885
385 -0.183879329117
386 -0.28971

254 -0.219808578283
255 -0.269696744542
256 -0.236634621071
257 -0.256580127305
258 -0.264965042674
259 -0.235097996403
260 -0.172936861873
261 -0.262067044238
262 -0.227275956941
263 -0.275516835706
264 -0.332526620427
265 -0.25335207821
266 -0.26985955854
267 -0.293388974189
268 -0.187529207785
269 -0.222613976989
270 -0.191481702495
271 -0.21516416902
272 -0.243716157759
273 -0.0877056779929
274 -0.22380856952
275 -0.202563723742
276 -0.304806850508
277 -0.224493341343
278 -0.211606059998
279 -0.266881131333
280 -0.25066482806
281 0.814583961021
282 -0.222861134767
283 -0.212668850437
284 -0.121907427567
285 -0.298036614507
286 -0.254171102277
287 -0.233626426711
288 -0.262607716741
289 -0.28811731723
290 -0.175756765221
291 -0.320175872019
292 -0.257735840396
293 -0.229956575893
294 -0.386994637515
295 -0.304728008551
296 -0.262462060422
297 -0.221176012887
298 -0.226968521399
299 -0.295089791042
300 -0.307266078891
301 -0.237212304869
302 -0.120412914664
303 -0.328313340134
304 -0

In [9]:
sample1_features = np.array(sample1_features)
sample2_features = np.array(sample3_features)
sample3_features = np.array(sample3_features)

In [20]:
sample1_features = pd.DataFrame(sample1_features)
sample2_features = pd.DataFrame(sample3_features)
sample3_features = pd.DataFrame(sample3_features)

In [28]:
display(sample1_features[sample1_features[0] > 0])
display(sample2_features[sample2_features[0] > 0])
display(sample3_features[sample3_features[0] > 0])

sample1_features[sample1_features[0] > 0].shape, sample1_features[sample1_features[0] > 0].shape, sample1_features[sample1_features[0] > 0].shape

Unnamed: 0,0
28,0.509665
48,0.301733
64,0.806657
105,0.711094
128,0.824249
153,0.74611
241,0.722647
281,0.819752
318,0.430927
336,0.774236


Unnamed: 0,0
28,0.478347
48,0.300182
64,0.771462
105,0.603788
128,0.795555
153,0.722771
241,0.749475
281,0.814584
318,0.362702
328,0.016767


Unnamed: 0,0
28,0.478347
48,0.300182
64,0.771462
105,0.603788
128,0.795555
153,0.722771
241,0.749475
281,0.814584
318,0.362702
328,0.016767


((20, 1), (20, 1), (20, 1))

In [35]:
sample1_features_dtree = []
sample2_features_dtree = []
sample3_features_dtree = []

for i in range(501):
    feature1 = mean_r2_for_feature(msample1, i)
    sample1_features_dtree.append(feature1)
    print(i, feature1)
for i in range(501):
    feature2 = mean_r2_for_feature(msample2, i)
    sample2_features_dtree.append(feature2)
    print(i, feature2)
for i in range(501):
    feature3 = mean_r2_for_feature(msample3, i)
    sample3_features_dtree.append(feature3)
    print(i, feature3)

0 -1.1117568174
1 -1.19831052889
2 -1.49418350927
3 -0.941166018444
4 -1.7563047078
5 -0.992529763062
6 -1.52206006552
7 -0.940518462483
8 -1.18620700489
9 -1.11298173196
10 -1.14380018202
11 -1.43494327037
12 -1.26076776447
13 -1.27039123545
14 -1.29757502661
15 -1.45359214728
16 -1.26379672148
17 -0.910025934459
18 -0.85397177139
19 -1.21998555318
20 -1.2903182762
21 -1.33120508303
22 -1.02940147089
23 -1.02733072651
24 -1.23759661791
25 -1.21305879331
26 -0.898958673429
27 -1.0749822517
28 0.954347241015
29 -1.27432024897
30 -0.914568061577
31 -1.09877400619
32 -1.06646194763
33 -1.40073720673
34 -1.0129972598
35 -1.17998094201
36 -1.40799406382
37 -1.12600867702
38 -1.48806024133
39 -1.40153937687
40 -1.16068056187
41 -1.00759964628
42 -1.40412638343
43 -1.12499716089
44 -0.870647717733
45 -1.0912101408
46 -1.05566326539
47 -1.32341303443
48 0.930032180711
49 -1.07574490841
50 -1.3159505432
51 -1.06264634216
52 -1.16614773126
53 -1.43482639004
54 -1.35633999703
55 -1.15699366816
56

436 -0.920665829479
437 -0.979657592711
438 -1.35309875516
439 -1.25121370188
440 -1.12738579342
441 -1.21339270308
442 0.896145006225
443 -1.00378853369
444 -1.09371414871
445 -1.01159101021
446 -1.08832887416
447 -1.2033272503
448 -0.998860525312
449 -1.1029968215
450 -1.50020988208
451 0.925716902752
452 -1.27558141329
453 0.887341950404
454 -0.986915401158
455 0.392726662295
456 -1.32964137794
457 -1.23307850838
458 -1.2540111608
459 -1.00631811152
460 -1.18949281018
461 -1.22229678761
462 -1.23654718578
463 -1.10196769982
464 -1.10540148113
465 -1.18280057029
466 -1.01210575638
467 -1.11537788275
468 -1.0715399406
469 -0.854342816233
470 -0.916366554511
471 -1.10254016001
472 0.915929925758
473 -1.02247629848
474 -1.35363386177
475 0.927443633394
476 -1.0310472209
477 -1.14961026933
478 -1.09608589102
479 -1.30522561928
480 -1.11347279225
481 -1.22418656612
482 -1.23344003903
483 -1.12984742605
484 -1.27208618207
485 -1.01143530722
486 -1.06091270323
487 -0.877597708286
488 -1.145

370 -1.07935697851
371 -1.47472813583
372 -1.07753287257
373 -1.00931058408
374 -0.907917968965
375 -1.56618476112
376 -1.14525402161
377 -0.987693915751
378 0.922402892753
379 -1.04149467079
380 -1.25907685808
381 -1.03349633843
382 -1.17723675594
383 -1.10296321339
384 -1.45576667244
385 -1.03842168939
386 -1.10940330938
387 -1.31785178996
388 -1.1252599408
389 -1.04420677511
390 -0.853182663761
391 -1.15988907222
392 -1.17093057993
393 -0.806524058161
394 -1.20359995763
395 -0.999339664969
396 -1.39511713413
397 -1.1104272264
398 -1.00073660368
399 -1.06392733979
400 -1.33927957877
401 -1.15201339738
402 -1.45875559224
403 -1.159252771
404 -1.04291498334
405 -1.05068983802
406 -1.19518599153
407 -1.1255600417
408 -1.23960381539
409 -0.868428057917
410 -1.34754089346
411 -1.27007122286
412 -1.08355981613
413 -1.05764167681
414 -0.829438810384
415 -0.996719812836
416 -1.4085999194
417 -1.16083193376
418 -1.1091081391
419 -1.24300053279
420 -1.21732367216
421 -1.15029128078
422 -0.8893

305 -1.22208875437
306 -1.18537124866
307 -0.984293728325
308 -1.22915511971
309 -1.15751415519
310 -0.916985659869
311 -1.15453619881
312 -1.27487993374
313 -1.17170308935
314 -0.922984418513
315 -1.15980515106
316 -1.00174720188
317 -0.978961762311
318 0.920030684972
319 -1.0266964556
320 -1.39134223484
321 -1.15457931955
322 -1.1968865705
323 -1.16775352592
324 -1.20262389572
325 -1.18332438552
326 -0.842714444507
327 -1.01414922978
328 -1.17388932101
329 -1.07439053409
330 -1.25864474482
331 -1.13191790365
332 -0.981308540369
333 -1.05341055435
334 -1.11103762964
335 -1.18189960665
336 0.913104612394
337 -1.07895740787
338 0.45864387692
339 -1.30221152483
340 -1.2512924179
341 -1.11127900648
342 -1.16652874522
343 -1.13179937873
344 -1.07579784163
345 -1.05522447228
346 -0.924581003339
347 -1.00205984634
348 -1.13562310752
349 -1.18064205292
350 -1.38016977583
351 -1.01775588528
352 -1.04333760448
353 -1.05195470846
354 -1.04383971597
355 -1.05033038624
356 -1.09453542316
357 -1.53

In [36]:
sample1_features_dtree = pd.DataFrame(sample1_features_dtree)
sample2_features_dtree = pd.DataFrame(sample3_features_dtree)
sample3_features_dtree = pd.DataFrame(sample3_features_dtree)

display(sample1_features_dtree[sample1_features_dtree[0] > 0])
display(sample2_features_dtree[sample2_features_dtree[0] > 0])
display(sample3_features_dtree[sample3_features_dtree[0] > 0])

Unnamed: 0,0
28,0.954347
48,0.930032
64,0.941577
105,0.910646
128,0.927444
153,0.934918
241,0.924707
281,0.946261
318,0.948153
336,0.920847


Unnamed: 0,0
28,0.930275
48,0.940344
64,0.923788
105,0.914993
128,0.929632
153,0.943705
241,0.931504
281,0.935466
318,0.920031
336,0.913105


Unnamed: 0,0
28,0.930275
48,0.940344
64,0.923788
105,0.914993
128,0.929632
153,0.943705
241,0.931504
281,0.935466
318,0.920031
336,0.913105


In [40]:
informative_features = [493,475,472,455,453,451,442,433,378,338,336,318,281,241,153,128,105,64,48,28,328]
informative_features = sort(informative_features)

NameError: name 'sort' is not defined

In [38]:
len(informative_features)

21

In [None]:
# corr = msample1.corr()
# mask = np.zeros_like(corr)
# mask[np.triu_indices_from(mask, 1)] = True
# with sns.axes_style("white"):
#     ax = sns.heatmap(corr, mask=mask, square=True, annot=True,
#                      cmap='RdBu', fmt='+.3f')
#     plt.xticks(rotation=45, ha='center')

# def rename_columns(df):
#     for i in range(len(df.columns)):
#         df.columns.values[i] = i
#     return df

# DBsample1_1 = rename_columns(DBsample1_1)
# DBsample2_1 = rename_columns(DBsample2_1)
# DBsample3_1 = rename_columns(DBsample3_1)

In [41]:
msample1[informative_features]

Unnamed: 0,493,475,472,455,453,451,442,433,378,338,...,318,281,241,153,128,105,64,48,28,328
1860,426,530,399,571,424,489,327,539,539,397,...,533,517,526,566,483,560,531,513,498,504
353,435,540,511,455,423,474,550,396,564,532,...,467,412,533,370,485,590,405,513,474,482
1333,594,469,422,555,571,484,358,622,582,384,...,517,561,486,664,486,572,586,542,492,487
905,617,521,452,413,597,483,429,616,414,334,...,514,557,518,659,487,605,412,421,489,483
1289,643,374,494,461,620,485,524,530,487,508,...,523,501,426,569,483,530,557,479,491,488
1273,598,594,508,354,583,470,555,526,455,386,...,445,498,571,540,490,611,312,452,467,488
938,580,599,459,489,570,467,445,589,545,347,...,430,555,563,656,488,609,455,525,464,480
1731,433,507,529,516,420,461,619,430,527,633,...,412,421,523,398,471,407,485,511,457,487
65,335,479,481,654,350,466,499,464,541,656,...,444,454,494,456,463,325,609,517,463,492
1323,658,375,527,500,645,471,585,566,467,555,...,457,523,426,591,470,405,612,461,471,478


In [None]:
sns.pairplot(sample_10pct_1, kind='reg')

Apply what you have learned in this morning's lesson to Project 3.


## Notebook 1

Think about the goals of Project 3 in terms of **supervised learning** and **unsupervised learning**. Use this to start your report!

## Notebook 2

Write a function to connect to the database. 

Make sure to close your connections. This is a best practice, but remember the more connections left hanging, the less memory the server will have. You are all sharing this server!

In [None]:
class DataBaseCommunication(object):

    # Initializing object.
    def __init__(self, host, dbname, user):

        self.host = host
        self.dbname = dbname
        self.user = user
        self.connected = False

    def connect(self):
        # Create a connection to a server
        self.con = pg2.connect(host = self.host,
                               dbname = self.dbname,
                               user = self.user)

       # Using a RealDictCursor as a means to do things over the
        # connection.
        self.cur = self.con.cursor(cursor_factory=RealDictCursor)

        self.connected = True

        return("Database connected")

    def execute(self, command_str, pdDataFrame = False):
        # Check if our class has connected, if not assert.
        assert self.connected, "Have not connected to the database!"

        self.cur.execute(command_str)
        results = self.cur.fetchall()

       # If res_pdDataFrame = True, return results a pandas dataframe.
        if (pdDataFrame):
            return(pd.DataFrame(results))
        else:
            return(results)

    def close(self):
        if (self.connected):

            self.con.close()
            self.connected = False          
            return("Database closed")

        else:
            return("Database not connected")

## Notebook 3

Start sampling the data.

Assess basic stats on the data on samples of the data and the whole dataset. Try to develop an intuition for how much data you need to make an observation about the data. 

## Notebook 4

Write a function to calculate the $R^2$ score for a dropped feature. There are 500 features for the smaller set. 5 of them are informative and another 15 are linear combinations of these 5. The other 480 are noise. Can you use this method to identify which features are informative? redundant? noise?

Use seaborn to visualize the dataset.

# Bonus

We did this in office hours the other day.

In [None]:
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
X, y = make_classification(100, 2, 1, 0, 0, 2, 1)
X[0,:].shape

In [None]:
plt.scatter(X[:,0], X[:,1], c=y)

In [None]:
X, y = make_classification(100, 2, 2, 0, 0, 2, 1)
plt.scatter(X[:,0], X[:,1], c=y)

In [None]:
X, y = make_classification(n_samples=100, 
                           n_features=3, 
                           n_informative=3, 
                           n_redundant=0, 
                           n_repeated=0, 
                           n_classes=2, 
                           n_clusters_per_class=3,
                           class_sep=4)

In [None]:
from mpl_toolkits.mplot3d import axes3d
fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(111, projection='3d')
data_1 = X[:,0]
data_2 = X[:,1]
data_3 = X[:,2]
_ = ax.scatter(data_1, data_2, data_3, c=y)

# rotate the view by changing these values
ax.view_init(40, 30)