<a href="https://colab.research.google.com/github/wel51x/DS-Unit-2-Sprint-3-Advanced-Regression/blob/master/Copy_of_DS2_231_Logistic_Regression_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Lecture - Where Linear goes Wrong
### Return of the Titanic 🚢

You've likely already explored the rich dataset that is the Titanic - let's use regression and try to predict survival with it. The data is [available from Kaggle](https://www.kaggle.com/c/titanic/data), so we'll also play a bit with [the Kaggle API](https://github.com/Kaggle/kaggle-api).

In [0]:
!pip install kaggle



In [0]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import sklearn
from sklearn.preprocessing import LabelEncoder
import datetime

In [0]:
# Note - you'll also have to sign up for Kaggle and authorize the API
# https://github.com/Kaggle/kaggle-api#api-credentials

# This essentially means uploading a kaggle.json file
# For Colab we can have it in Google Drive
from google.colab import drive
drive.mount('/content/drive')
%env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

# You also have to join the Titanic competition to have access to the data
!kaggle competitions download -c titanic

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
env: KAGGLE_CONFIG_DIR=/content/drive/My Drive/
401 - Unauthorized


In [0]:
# The logistic sigmoid "squishing" function, implemented to accept numpy arrays
def sigmoid(x):
  return 1 / (1 + np.e**(-x))

In [0]:
?LogisticRegression

## Assignment - real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

In [2]:
!wget https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
!unzip fma_metadata.zip

--2019-02-26 14:27:38--  https://os.unil.cloud.switch.ch/fma/fma_metadata.zip
Resolving os.unil.cloud.switch.ch (os.unil.cloud.switch.ch)... 86.119.28.13, 2001:620:5ca1:2ff::ce53
Connecting to os.unil.cloud.switch.ch (os.unil.cloud.switch.ch)|86.119.28.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358412441 (342M) [application/zip]
Saving to: ‘fma_metadata.zip’


2019-02-26 14:28:10 (11.1 MB/s) - ‘fma_metadata.zip’ saved [358412441/358412441]

Archive:  fma_metadata.zip
 bunzipping: fma_metadata/README.txt  
 bunzipping: fma_metadata/checksums  
 bunzipping: fma_metadata/not_found.pickle  
 bunzipping: fma_metadata/raw_genres.csv  
 bunzipping: fma_metadata/raw_albums.csv  
 bunzipping: fma_metadata/raw_artists.csv  
 bunzipping: fma_metadata/raw_tracks.csv  
 bunzipping: fma_metadata/tracks.csv  
 bunzipping: fma_metadata/genres.csv  
 bunzipping: fma_metadata/raw_echonest.csv  
 bunzipping: fma_metadata/echonest.csv  
 bunzipping: fma_metadata/features.

In [3]:
!ls -lsa *

350020 -rw-r--r-- 1 root root 358412441 May  9  2017 fma_metadata.zip

fma_metadata:
total 1430544
     4 drwxr-xr-x 2 root root      4096 Feb 26 14:28 .
     4 drwxr-xr-x 1 root root      4096 Feb 26 14:28 ..
     4 -r--r--r-- 1 root root       563 Apr  1  2017 checksums
 42972 -r--r--r-- 1 root root  44000447 Apr  1  2017 echonest.csv
928832 -r--r--r-- 1 root root 951117185 Apr  1  2017 features.csv
     4 -r--r--r-- 1 root root      3922 Apr  1  2017 genres.csv
   212 -r--r--r-- 1 root root    216942 Apr  1  2017 not_found.pickle
 23580 -r--r--r-- 1 root root  24144296 Apr  1  2017 raw_albums.csv
 13716 -r--r--r-- 1 root root  14044281 Apr  1  2017 raw_artists.csv
 47504 -r--r--r-- 1 root root  48642077 Apr  1  2017 raw_echonest.csv
     8 -r--r--r-- 1 root root      5866 Apr  1  2017 raw_genres.csv
119384 -r--r--r-- 1 root root 122246181 Apr  1  2017 raw_tracks.csv
     4 -r--r--r-- 1 root root       256 Apr  1  2017 README.txt
254316 -r--r--r-- 1 root root 260414445 Apr  1  2017 t

In [4]:
!head fma_metadata/tracks.csv

,album,album,album,album,album,album,album,album,album,album,album,album,album,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,set,set,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track
,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
track_id,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A 

In [5]:
tracks = pd.read_csv('fma_metadata/tracks.csv', skiprows=1)

  interactivity=interactivity, compiler=compiler, result=result)


In [9]:
tracks.describe()

Unnamed: 0,comments,favorites,id,listens,tracks,comments.1,favorites.1,id.1,latitude,longitude,bit_rate,comments.2,duration,favorites.2,interest,listens.1,number
count,106574.0,106574.0,106574.0,106574.0,106574.0,106574.0,106574.0,106574.0,44544.0,44544.0,106574.0,106574.0,106574.0,106574.0,106574.0,106574.0,106574.0
mean,0.394946,1.286927,12826.933914,32120.31,19.721452,1.894702,30.041915,12036.770404,39.901626,-38.668642,263274.695048,0.031621,277.8491,3.182521,3541.31,2329.353548,8.260945
std,2.268915,3.133035,6290.261805,147853.2,39.943673,6.297679,100.511408,6881.420867,18.24086,65.23722,67623.443584,0.321993,305.518553,13.51382,19017.43,8028.070647,15.243271
min,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-45.87876,-157.526855,-1.0,0.0,0.0,0.0,2.0,0.0,0.0
25%,0.0,0.0,7793.0,3361.0,7.0,0.0,1.0,6443.0,39.271398,-79.997459,192000.0,0.0,149.0,0.0,599.0,292.0,2.0
50%,0.0,0.0,13374.0,8982.0,11.0,0.0,5.0,12029.5,41.387917,-73.554431,299914.0,0.0,216.0,1.0,1314.0,764.0,5.0
75%,0.0,1.0,18203.0,23635.0,17.0,1.0,16.0,18011.0,48.85693,4.35171,320000.0,0.0,305.0,3.0,3059.0,2018.0,9.0
max,53.0,61.0,22940.0,3564243.0,652.0,79.0,963.0,24357.0,67.286005,175.277,448000.0,37.0,18350.0,1482.0,3293557.0,543252.0,255.0


In [0]:

pd.set_option('display.max_columns', None)  # Unlimited columns
tracks.head()

Unnamed: 0,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.1,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
1,0.0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4.0,1.0,<p></p>,6073.0,,[],AWOL - A Way Of Life,7.0,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0.0,2008-11-26 01:42:32,9.0,1.0,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000.0,0.0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168.0,2.0,Hip-Hop,[21],[21],,4656.0,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293.0,,3.0,,[],Food
2,0.0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4.0,1.0,<p></p>,6073.0,,[],AWOL - A Way Of Life,7.0,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0.0,2008-11-26 01:42:32,9.0,1.0,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000.0,0.0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237.0,1.0,Hip-Hop,[21],[21],,1470.0,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514.0,,4.0,,[],Electric Ave
3,0.0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4.0,1.0,<p></p>,6073.0,,[],AWOL - A Way Of Life,7.0,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0.0,2008-11-26 01:42:32,9.0,1.0,40.058324,New Jersey,-74.405661,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000.0,0.0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206.0,6.0,Hip-Hop,[21],[21],,1933.0,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151.0,,6.0,,[],This World
4,0.0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4.0,6.0,,47632.0,,[],Constant Hitmaker,2.0,Album,,,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3.0,2008-11-26 01:42:55,74.0,6.0,,,,"Kurt Vile, the Violators",Kurt Vile,,"['philly', 'kurt vile']",http://kurtvile.com,,training,small,192000.0,0.0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161.0,178.0,Pop,[10],[10],,54881.0,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135.0,,1.0,,[],Freeway
5,0.0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2.0,4.0,"<p> ""spiritual songs"" from Nicky Cook</p>",2710.0,,[],Niris,13.0,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2.0,2008-11-26 01:42:52,10.0,4.0,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000.0,0.0,,2008-11-26 01:48:56,2008-01-01 00:00:00,311.0,0.0,,"[76, 103]","[17, 10, 76, 103]",,978.0,en,Attribution-NonCommercial-NoDerivatives (aka M...,361.0,,3.0,,[],Spiritual Level


In [0]:
# run after each read
tracks = tracks.drop(tracks.index[0])
del tracks['Unnamed: 0']

In [8]:
tracks.shape

(106574, 52)

In [0]:
#genres = pd.read_csv('fma_metadata/genres.csv')

In [0]:
#genres

In [10]:
tracks['genre_top'].value_counts()

Rock                   14182
Experimental           10608
Electronic              9372
Hip-Hop                 3552
Folk                    2803
Pop                     2332
Instrumental            2079
International           1389
Classical               1230
Jazz                     571
Old-Time / Historic      554
Spoken                   423
Country                  194
Soul-RnB                 175
Blues                    110
Easy Listening            24
Name: genre_top, dtype: int64

In [11]:
tracks.isnull().sum().sort_values(ascending=False)

lyricist             106263
publisher            105311
information.1        104225
composer             102904
active_year_end      101199
wikipedia_page       100993
date_recorded        100415
related_projects      93422
associated_labels     92303
language_code         91550
engineer              91279
producer              88514
active_year_begin     83863
latitude              62030
longitude             62030
members               59725
genre_top             56976
location              36364
date_released         36280
bio                   35418
website               27318
information           23425
type                   6508
date_created           3529
title                  1025
date_created.1          856
license                  87
title.1                   1
tracks                    0
tags                      0
comments.1                0
listens                   0
id                        0
favorites                 0
name                      0
favorites.1         

In [0]:
df = tracks.dropna(subset=['genre_top'])

In [13]:
df.isnull().sum().sort_values(ascending=False)

lyricist             49498
publisher            49188
information.1        48294
composer             48245
active_year_end      47200
wikipedia_page       46864
date_recorded        45347
related_projects     44233
associated_labels    43455
engineer             40687
language_code        40215
active_year_begin    40016
producer             39372
latitude             28618
longitude            28618
members              27653
location             17488
bio                  16194
date_released        16086
website              12693
information          10154
type                  2047
date_created          1051
title                  309
date_created.1         215
license                 59
title.1                  1
tags                     0
tracks                   0
listens                  0
id                       0
favorites                0
comments.1               0
name                     0
favorites.1              0
id.1                     0
number                   0
l

In [0]:
#df.columns[df.isna().any()].tolist()

In [0]:
df1 = df.drop(columns=df.columns[df.isna().any()].tolist())

In [17]:
df1.shape

(49598, 25)

In [18]:
df1.dtypes

comments          float64
favorites         float64
id                float64
listens           float64
tags               object
tracks            float64
comments.1        float64
favorites.1       float64
id.1              float64
name               object
tags.1             object
split              object
subset             object
bit_rate          float64
comments.2        float64
date_created.2     object
duration          float64
favorites.2       float64
genre_top          object
genres             object
genres_all         object
interest          float64
listens.1         float64
number            float64
tags.2             object
dtype: object

In [19]:
df1.genre_top.value_counts()

Rock                   14182
Experimental           10608
Electronic              9372
Hip-Hop                 3552
Folk                    2803
Pop                     2332
Instrumental            2079
International           1389
Classical               1230
Jazz                     571
Old-Time / Historic      554
Spoken                   423
Country                  194
Soul-RnB                 175
Blues                    110
Easy Listening            24
Name: genre_top, dtype: int64

In [21]:
df1.isnull().sum()

comments          0
favorites         0
id                0
listens           0
tags              0
tracks            0
comments.1        0
favorites.1       0
id.1              0
name              0
tags.1            0
split             0
subset            0
bit_rate          0
comments.2        0
date_created.2    0
duration          0
favorites.2       0
genre_top         0
genres            0
genres_all        0
interest          0
listens.1         0
number            0
tags.2            0
dtype: int64

In [0]:
df1.sample(11)

Unnamed: 0,comments,favorites,id,listens,tags,tracks,comments.1,favorites.1,id.1,name,tags.1,split,subset,bit_rate,comments.2,date_created.2,duration,favorites.2,genre_top,genres,genres_all,interest,listens.1,number,tags.2
45603,1.0,1.0,12001.0,26989.0,[],17.0,3.0,109.0,8021.0,Anitek,"['new music', 'electronica', 'experimental', '...",training,medium,189493.0,0.0,2012-06-25 14:34:16,122.0,3.0,Electronic,"[286, 495]","[495, 286, 15]",1922.0,1442.0,0.0,[]
101204,0.0,0.0,21982.0,20513.0,[],25.0,0.0,8.0,19740.0,The Zombie Dandies,['zombie dandies'],training,medium,320000.0,0.0,2016-12-10 07:05:22,240.0,3.0,Rock,"[12, 25, 111]","[25, 12, 111]",1282.0,954.0,2.0,[]
15027,0.0,2.0,5480.0,9216.0,[],0.0,2.0,7.0,6227.0,Golden Hits,['golden hits'],training,large,230766.0,0.0,2010-01-21 19:37:14,118.0,2.0,Experimental,[47],"[38, 47]",1965.0,1375.0,10.0,[]
22803,0.0,1.0,7320.0,5857.0,[],10.0,2.0,15.0,3222.0,Garmisch,['garmisch'],test,small,320000.0,0.0,2010-09-14 15:17:18,137.0,2.0,Pop,"[10, 362]","[10, 362]",816.0,572.0,6.0,[]
2382,0.0,1.0,1460.0,4507.0,['new orleans'],7.0,0.0,8.0,1165.0,Quintron,"['new orleans', 'quintron']",training,large,256000.0,0.0,2008-12-04 20:50:23,1323.0,0.0,Rock,[12],[12],938.0,260.0,4.0,['new orleans']
25695,0.0,0.0,3776.0,4955.0,[],4.0,0.0,4.0,4432.0,Computer Jesus Refrigerator,['computer jesus refrigerator'],training,large,224000.0,0.0,2010-11-23 14:09:51,29.0,1.0,Experimental,"[1, 32]","[32, 1, 38]",957.0,332.0,0.0,[]
106317,0.0,0.0,22875.0,794.0,"['ambient', 'post-concrete', 'sound art']",32.0,0.0,0.0,24314.0,Razabri & Lezet,['razabri lezet'],validation,large,320000.0,0.0,2017-03-21 04:03:44,22.0,0.0,Experimental,"[38, 41, 247]","[41, 38, 247]",38.0,34.0,11.0,"['ambient', 'post-concrete', 'sound art']"
23544,0.0,0.0,7446.0,29689.0,[],14.0,0.0,5.0,6196.0,Lou Barlow,['lou barlow'],training,medium,128000.0,0.0,2010-09-30 16:34:02,169.0,2.0,Rock,[66],"[66, 12]",1376.0,1105.0,2.0,[]
13605,0.0,0.0,4874.0,2236.0,[],6.0,0.0,0.0,5833.0,Syem,['syem'],training,medium,320000.0,0.0,2009-12-11 17:35:49,203.0,2.0,Electronic,[185],"[185, 15]",1140.0,801.0,4.0,[]
85635,0.0,0.0,19199.0,3439.0,[],8.0,0.0,3.0,21055.0,Kacéo,['kaco'],training,large,245078.0,0.0,2015-10-31 09:13:20,260.0,0.0,Folk,[103],"[17, 103]",397.0,298.0,4.0,[]


This is the biggest data you've played with so far, and while it does generally fit in Colab, it can take awhile to run. That's part of the challenge!

Your tasks:
- Clean up the variable names in the dataframe
- Use logistic regression to fit a model predicting (primary/top) genre
- Inspect, iterate, and improve your model
- Answer the following questions (written, ~paragraph each):
  - What are the best predictors of genre?
  - What information isn't very useful for predicting genre?
  - What surprised you the most about your results?

*Important caveats*:
- This is going to be difficult data to work with - don't let the perfect be the enemy of the good!
- Be creative in cleaning it up - if the best way you know how to do it is download it locally and edit as a spreadsheet, that's OK!
- If the data size becomes problematic, consider sampling/subsetting
- You do not need perfect or complete results - just something plausible that runs, and that supports the reasoning in your written answers

If you find that fitting a model to classify *all* genres isn't very good, it's totally OK to limit to the most frequent genres, or perhaps trying to combine or cluster genres as a preprocessing step. Even then, there will be limits to how good a model can be with just this metadata - if you really want to train an effective genre classifier, you'll have to involve the other data (see stretch goals).

This is real data - there is no "one correct answer", so you can take this in a variety of directions. Just make sure to support your findings, and feel free to share them as well! This is meant to be practice for dealing with other "messy" data, a common task in data science.

In [0]:
# encode genre_top
df1["genre_top_num"] = LabelEncoder().fit_transform(df1["genre_top"])

In [0]:
df2 = df1._get_numeric_data()

In [24]:
df2.shape

(49598, 16)

In [25]:
df2.describe()

Unnamed: 0,comments,favorites,id,listens,tracks,comments.1,favorites.1,id.1,bit_rate,comments.2,duration,favorites.2,interest,listens.1,number,genre_top_num
count,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0,49598.0
mean,0.333179,1.145449,11813.603351,19974.69,21.763035,1.16291,16.721602,11400.403786,260278.65793,0.024336,268.627263,2.381447,2523.905,1586.32838,8.545607,7.98383
std,1.312889,2.463242,6455.900325,57369.8,51.492489,4.18622,58.966302,7046.466624,65663.862632,0.332693,284.327919,11.147578,19802.85,6039.952955,16.98826,3.921259
min,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,-1.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0
25%,0.0,0.0,6530.0,2602.0,7.0,0.0,1.0,5449.0,192000.0,0.0,146.0,0.0,456.0,212.0,2.0,5.0
50%,0.0,0.0,11887.0,6092.0,11.0,0.0,4.0,11384.0,256000.0,0.0,211.0,1.0,938.0,520.0,5.0,7.0
75%,0.0,1.0,17410.0,16219.0,17.0,1.0,12.0,17450.0,320000.0,0.0,299.0,2.0,2091.0,1321.0,9.0,13.0
max,17.0,40.0,22940.0,1193803.0,652.0,68.0,963.0,24357.0,448000.0,37.0,11030.0,1482.0,3293557.0,543252.0,255.0,15.0


In [0]:
df2.corr()

In [0]:
df2.genre_top_num.value_counts()

In [0]:
X = df2.drop(columns = ['genre_top_num'])
y = df2['genre_top_num']

In [0]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size=0.2, random_state=42)

model = LogisticRegression(random_state=42, solver = 'lbfgs',
                           multi_class = 'multinomial', max_iter=25000)

model.fit(Xtrain, ytrain)
model.score(Xtest, ytest)

In [0]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size=0.2, random_state=42)

model2 = LogisticRegression(random_state=42, solver = 'saga',
                            multi_class = 'multinomial', max_iter=8000)

model2.fit(Xtrain, ytrain)
model2.score(Xtest, ytest)

In [0]:
# do correlation & keep positively correlated columns
corr_matrix = df2.corr().sort_values('genre_top_num', ascending=False)
df_corr = pd.DataFrame(corr_matrix.genre_top_num[:-1])
X_list = list(df_corr[df_corr.genre_top_num < 0.0].T)

In [0]:
#df_corr

In [29]:
X_list

['comments.2',
 'id.1',
 'interest',
 'favorites.2',
 'id',
 'favorites.1',
 'listens.1',
 'tracks',
 'comments',
 'number',
 'comments.1',
 'listens',
 'favorites',
 'bit_rate']

In [0]:
#X = df2.loc[:, ['comments.2', 'id.1', 'interest']].values

In [35]:
all_list = [(str(datetime.datetime.now()))]
for i in range(len(X_list)):
    for j in range(i+1, len(X_list)):
        list = X_list[i:j]
        print(list)
        X = df2.loc[:, list].values
        all_list.append(list)
        Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42)
        modelx = LogisticRegression(random_state=42, solver = 'lbfgs',
                            multi_class = 'multinomial', max_iter=1000)
        modelx.fit(Xtrain, ytrain)
        list = "score: " + str(modelx.score(Xtest, ytest))
        print(list)
        all_list.append(list)
        list = "completion time: " + str(datetime.datetime.now())
        print(list)
        all_list.append(list)
    list = X_list[i:j+1]
    print(list)
    X = df2.loc[:, list].values
    all_list.append(list)
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42)
    modelx = LogisticRegression(random_state=42, solver = 'lbfgs',
                            multi_class = 'multinomial', max_iter=1000)
    modelx.fit(Xtrain, ytrain)
    list = "score: " + str(modelx.score(Xtest, ytest))
    print(list)
    all_list.append(list)
    list = "completion time: " + str(datetime.datetime.now())
    print (list)
    all_list.append(list)
#print(all_list)


['comments.2']
score: 0.2872983870967742
completion time: 2019-02-26 15:45:32.094405
['comments.2', 'id.1']
score: 0.28689516129032255
completion time: 2019-02-26 15:45:36.732394
['comments.2', 'id.1', 'interest']




score: 0.30252016129032255
completion time: 2019-02-26 15:46:50.280789
['comments.2', 'id.1', 'interest', 'favorites.2']




score: 0.2746975806451613
completion time: 2019-02-26 15:48:05.310638
['comments.2', 'id.1', 'interest', 'favorites.2', 'id']




score: 0.28891129032258067
completion time: 2019-02-26 15:49:19.505719
['comments.2', 'id.1', 'interest', 'favorites.2', 'id', 'favorites.1']




score: 0.2809475806451613
completion time: 2019-02-26 15:50:33.636561
['comments.2', 'id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1']




score: 0.31582661290322583
completion time: 2019-02-26 15:51:48.463168
['comments.2', 'id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks']




score: 0.3143145161290323
completion time: 2019-02-26 15:53:04.074168
['comments.2', 'id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments']




score: 0.3143145161290323
completion time: 2019-02-26 15:54:18.007911
['comments.2', 'id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number']




score: 0.31421370967741935
completion time: 2019-02-26 15:55:32.754004
['comments.2', 'id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1']




score: 0.3143145161290323
completion time: 2019-02-26 15:56:48.506578
['comments.2', 'id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens']




score: 0.333366935483871
completion time: 2019-02-26 15:58:06.056433
['comments.2', 'id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites']




score: 0.33407258064516127
completion time: 2019-02-26 15:59:23.073090
['comments.2', 'id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites', 'bit_rate']




score: 0.3551411290322581
completion time: 2019-02-26 16:00:41.236831
['id.1']
score: 0.28689516129032255
completion time: 2019-02-26 16:00:45.681725
['id.1', 'interest']
score: 0.2720766129032258
completion time: 2019-02-26 16:01:05.815097
['id.1', 'interest', 'favorites.2']




score: 0.2995967741935484
completion time: 2019-02-26 16:02:19.138610
['id.1', 'interest', 'favorites.2', 'id']




score: 0.2918346774193548
completion time: 2019-02-26 16:03:32.170506
['id.1', 'interest', 'favorites.2', 'id', 'favorites.1']




score: 0.2813508064516129
completion time: 2019-02-26 16:04:44.567510
['id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1']




score: 0.314616935483871
completion time: 2019-02-26 16:05:58.701534
['id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks']




score: 0.3116935483870968
completion time: 2019-02-26 16:07:15.475105
['id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments']




score: 0.31683467741935484
completion time: 2019-02-26 16:08:30.712397
['id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number']




score: 0.31340725806451614
completion time: 2019-02-26 16:09:45.429863
['id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1']




score: 0.314616935483871
completion time: 2019-02-26 16:11:01.345206
['id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens']




score: 0.33377016129032255
completion time: 2019-02-26 16:12:19.500583
['id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites']




score: 0.3336693548387097
completion time: 2019-02-26 16:13:37.697653
['id.1', 'interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites', 'bit_rate']




score: 0.35342741935483873
completion time: 2019-02-26 16:14:55.191895
['interest']
score: 0.3056451612903226
completion time: 2019-02-26 16:16:06.403969
['interest', 'favorites.2']




score: 0.2900201612903226
completion time: 2019-02-26 16:17:22.661866
['interest', 'favorites.2', 'id']
score: 0.28891129032258067
completion time: 2019-02-26 16:17:51.301819
['interest', 'favorites.2', 'id', 'favorites.1']




score: 0.27933467741935486
completion time: 2019-02-26 16:19:05.418003
['interest', 'favorites.2', 'id', 'favorites.1', 'listens.1']




score: 0.31290322580645163
completion time: 2019-02-26 16:20:18.831759
['interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks']




score: 0.31502016129032256
completion time: 2019-02-26 16:21:33.229451
['interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments']




score: 0.3161290322580645
completion time: 2019-02-26 16:22:50.217942
['interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number']




score: 0.31733870967741934
completion time: 2019-02-26 16:24:05.839306
['interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1']




score: 0.31713709677419355
completion time: 2019-02-26 16:25:21.734093
['interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens']




score: 0.33518145161290325
completion time: 2019-02-26 16:26:38.311946
['interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites']




score: 0.3367943548387097
completion time: 2019-02-26 16:27:54.793128
['interest', 'favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites', 'bit_rate']




score: 0.3551411290322581
completion time: 2019-02-26 16:29:12.193623
['favorites.2']
score: 0.2991935483870968
completion time: 2019-02-26 16:29:47.482898
['favorites.2', 'id']




score: 0.29193548387096774
completion time: 2019-02-26 16:31:00.388174
['favorites.2', 'id', 'favorites.1']




score: 0.27872983870967744
completion time: 2019-02-26 16:32:14.081570
['favorites.2', 'id', 'favorites.1', 'listens.1']




score: 0.3100806451612903
completion time: 2019-02-26 16:33:28.816812
['favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks']




score: 0.3209677419354839
completion time: 2019-02-26 16:34:43.881192
['favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments']




score: 0.31582661290322583
completion time: 2019-02-26 16:35:59.357857
['favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number']




score: 0.31582661290322583
completion time: 2019-02-26 16:37:14.815001
['favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1']




score: 0.3180443548387097
completion time: 2019-02-26 16:38:30.912895
['favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens']




score: 0.3336693548387097
completion time: 2019-02-26 16:39:47.142142
['favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites']




score: 0.3334677419354839
completion time: 2019-02-26 16:41:04.110218
['favorites.2', 'id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites', 'bit_rate']




score: 0.3512096774193548
completion time: 2019-02-26 16:42:19.309547
['id']
score: 0.286491935483871
completion time: 2019-02-26 16:42:25.124309
['id', 'favorites.1']




score: 0.2730846774193548
completion time: 2019-02-26 16:43:39.919866
['id', 'favorites.1', 'listens.1']




score: 0.3095766129032258
completion time: 2019-02-26 16:44:53.988719
['id', 'favorites.1', 'listens.1', 'tracks']




score: 0.31441532258064514
completion time: 2019-02-26 16:46:08.574518
['id', 'favorites.1', 'listens.1', 'tracks', 'comments']




score: 0.3167338709677419
completion time: 2019-02-26 16:47:22.945289
['id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number']




score: 0.3170362903225806
completion time: 2019-02-26 16:48:38.573436
['id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1']




score: 0.3199596774193548
completion time: 2019-02-26 16:49:54.337132
['id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens']




score: 0.33377016129032255
completion time: 2019-02-26 16:51:10.235954
['id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites']




score: 0.3328629032258065
completion time: 2019-02-26 16:52:26.569608
['id', 'favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites', 'bit_rate']




score: 0.3495967741935484
completion time: 2019-02-26 16:53:44.130192
['favorites.1']




score: 0.2971774193548387
completion time: 2019-02-26 16:54:57.835811
['favorites.1', 'listens.1']




score: 0.2973790322580645
completion time: 2019-02-26 16:56:12.539717
['favorites.1', 'listens.1', 'tracks']




score: 0.3439516129032258
completion time: 2019-02-26 16:57:25.694310
['favorites.1', 'listens.1', 'tracks', 'comments']




score: 0.3241935483870968
completion time: 2019-02-26 16:58:39.715779
['favorites.1', 'listens.1', 'tracks', 'comments', 'number']




score: 0.32026209677419354
completion time: 2019-02-26 16:59:52.970101
['favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1']




score: 0.3409274193548387
completion time: 2019-02-26 17:01:08.088470
['favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens']




score: 0.28286290322580643
completion time: 2019-02-26 17:02:22.681391
['favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites']




score: 0.2835685483870968
completion time: 2019-02-26 17:03:37.991837
['favorites.1', 'listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites', 'bit_rate']




score: 0.3498991935483871
completion time: 2019-02-26 17:04:55.251365
['listens.1']




score: 0.3058467741935484
completion time: 2019-02-26 17:06:09.572522
['listens.1', 'tracks']




score: 0.3191532258064516
completion time: 2019-02-26 17:07:24.362030
['listens.1', 'tracks', 'comments']




score: 0.3217741935483871
completion time: 2019-02-26 17:08:38.419248
['listens.1', 'tracks', 'comments', 'number']




score: 0.3227822580645161
completion time: 2019-02-26 17:09:52.404080
['listens.1', 'tracks', 'comments', 'number', 'comments.1']




score: 0.2974798387096774
completion time: 2019-02-26 17:11:07.482384
['listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens']




score: 0.28286290322580643
completion time: 2019-02-26 17:12:21.476473
['listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites']




score: 0.2865927419354839
completion time: 2019-02-26 17:13:36.530659
['listens.1', 'tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites', 'bit_rate']




score: 0.3497983870967742
completion time: 2019-02-26 17:14:53.417972
['tracks']




score: 0.33225806451612905
completion time: 2019-02-26 17:16:06.579216
['tracks', 'comments']




score: 0.33306451612903226
completion time: 2019-02-26 17:17:20.665389
['tracks', 'comments', 'number']




score: 0.3318548387096774
completion time: 2019-02-26 17:18:36.324563
['tracks', 'comments', 'number', 'comments.1']




score: 0.33830645161290324
completion time: 2019-02-26 17:19:52.809626
['tracks', 'comments', 'number', 'comments.1', 'listens']




score: 0.2967741935483871
completion time: 2019-02-26 17:21:06.672860
['tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites']




score: 0.28296370967741935
completion time: 2019-02-26 17:22:21.110293
['tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites', 'bit_rate']




score: 0.35816532258064515
completion time: 2019-02-26 17:23:37.289588
['comments']
score: 0.2939516129032258
completion time: 2019-02-26 17:23:43.303298
['comments', 'number']
score: 0.32106854838709675
completion time: 2019-02-26 17:24:55.637787
['comments', 'number', 'comments.1']
score: 0.3279233870967742
completion time: 2019-02-26 17:26:05.485864
['comments', 'number', 'comments.1', 'listens']




score: 0.2998991935483871
completion time: 2019-02-26 17:27:19.590125
['comments', 'number', 'comments.1', 'listens', 'favorites']




score: 0.30625
completion time: 2019-02-26 17:28:34.743114
['comments', 'number', 'comments.1', 'listens', 'favorites', 'bit_rate']




score: 0.3443548387096774
completion time: 2019-02-26 17:29:52.441760
['number']
score: 0.3227822580645161
completion time: 2019-02-26 17:30:53.410472
['number', 'comments.1']
score: 0.3275201612903226
completion time: 2019-02-26 17:31:56.385442
['number', 'comments.1', 'listens']




score: 0.27610887096774195
completion time: 2019-02-26 17:33:10.700930
['number', 'comments.1', 'listens', 'favorites']




score: 0.2745967741935484
completion time: 2019-02-26 17:34:26.084018
['number', 'comments.1', 'listens', 'favorites', 'bit_rate']
score: 0.3432459677419355
completion time: 2019-02-26 17:35:38.096895
['comments.1']
score: 0.2957661290322581
completion time: 2019-02-26 17:35:52.911908
['comments.1', 'listens']
score: 0.18588709677419354
completion time: 2019-02-26 17:36:03.111208
['comments.1', 'listens', 'favorites']
score: 0.18588709677419354
completion time: 2019-02-26 17:36:13.016574
['comments.1', 'listens', 'favorites', 'bit_rate']
score: 0.3127016129032258
completion time: 2019-02-26 17:36:27.554626
['listens']
score: 0.3138104838709677
completion time: 2019-02-26 17:37:32.111882
['listens', 'favorites']
score: 0.18588709677419354
completion time: 2019-02-26 17:37:42.694346
['listens', 'favorites', 'bit_rate']
score: 0.3127016129032258
completion time: 2019-02-26 17:38:00.048580
['favorites']
score: 0.2940524193548387
completion time: 2019-02-26 17:38:09.820469
['favorites', 'bi

In [0]:
for item in all_list:
  print(item)

In [0]:
#with open('LogisticRegression.txt', 'w') as f:
#    for item in all_list:
#        f.write("%s\n" % item)

In [43]:
#!ls -lsd /content

4 drwxr-xr-x 1 root root 4096 Feb 26 18:58 /content


In [0]:
#from google.colab import drive
#drive.mount('/content/drive')


In [0]:
X = df2.loc[:, ['tracks', 'comments', 'number', 'comments.1', 'listens', 'favorites', 'bit_rate']].values

In [45]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42)

model4 = LogisticRegression(random_state=42, solver = 'lbfgs',
                            multi_class = 'multinomial', max_iter=25000)

model4.fit(Xtrain, ytrain)
model4.score(Xtest, ytest)

0.3625

## Resources and stretch goals

- Check out the other .csv files from the FMA dataset, and see if you can join them or otherwise fit interesting models with them
- [Logistic regression from scratch in numpy](https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f) - if you want to dig in a bit more to both the code and math (also takes a gradient descent approach, introducing the logistic loss function)
- Create a visualization to show predictions of your model - ideally show a confidence interval based on error!
- Check out and compare classification models from scikit-learn, such as [SVM](https://scikit-learn.org/stable/modules/svm.html#classification), [decision trees](https://scikit-learn.org/stable/modules/tree.html#classification), and [naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). The underlying math will vary significantly, but the API (how you write the code) and interpretation will actually be fairly similar.
- Sign up for [Kaggle](https://kaggle.com), and find a competition to try logistic regression with
- (Not logistic regression related) If you enjoyed the assignment, you may want to read up on [music informatics](https://en.wikipedia.org/wiki/Music_informatics), which is how those audio features were actually calculated. The FMA includes the actual raw audio, so (while this is more of a longterm project than a stretch goal, and won't fit in Colab) if you'd like you can check those out and see what sort of deeper analysis you can do.