## Context

* For this project, you will take part in a Kaggle competition based on tabular data. The goal is to design a machine learning algorithm that, given information on a particular concertgoer experience, can automatically classify the enjoyment of that concertgoer to that concert. In this classification problem, we have 4 classes. The training dataset consists of 170,000 training examples and the testing dataset contains 30,000 test examples.

* Each training rows contains a unique ID, 18 attributes and 1 target containing the class
that needs to be predicted. You will be evaluated on the test private leaderboard mean
F1-Score.

## Instructions

* To participate in the competition, you must provide a list of predicted outputs for the
instances on the Kaggle website. To solve the problem you are encouraged to use any
classification methods you can think off, presented in the course or otherwise. Looking
into creative way to create new features from those provided may prove especially usefull
in this competition.

* The goal of this competition is to classify a particular concert experience in one of four classes: 
1. Worst Concert Ever
2. Did Not Enjoy
3. Enjoyed
4. Best Concert Ever

* To perform this task you will be given information on the band, the venue, as well as the specific concertgoers.
 
* The dataset contains information on the specific concert, the specific band, and the specific concertgoers. It is to be noted that all three of those are unchanged across all of the training data and test data. Any conclusions on the specific characteristics of a band, concert, or concertgoers will also extend to the test set.

* Unfortunately, the data-gathering step was not impeccable. One can expect some of the training attributes not always to reflect the underlying reality. However, the "Concert Experience" column has been verified and is 100% accurate.

In [121]:
# Import packages
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt


# Load data
train_data = pd.read_csv('./data/train_cleaned3.csv')
valid_data = pd.read_csv('./data/test_cleaned3.csv')

In [122]:
# Take the 14000 rows of train_data for valid_data
'''
valid_data = train_data.iloc[:14000,:]
train_data = train_data.iloc[14000:,:]
'''

In [123]:
train_data

Unnamed: 0,Id,Band Name,Band Genre,Band Country of Origin,Band Debut,Concert ID,Concert Attendance,Inside Venue,Rain,Seated,Personnality Trait 1,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Concert Goer Age,Concert Goer ID,Height (cm),Concert Goer Country of Origin,Concert Enjoyment
14000,ConcertExperience_140683,Ronatron,Hip Hop/Rap,Canada,1958.0,703.0,1096.0,False,False,False,1.218456,-1.322337,-0.318820,0.012531,29.0,concert_goer_684,188.0,Monaco,Best Concert Ever
14001,ConcertExperience_77278,Devon Revival,RnB,United States of America (USA),1992.0,386.0,8103.0,False,True,False,1.855238,0.436764,0.327848,0.877945,51.0,concert_goer_411,179.0,Canada,Worst Concert Ever
14002,ConcertExperience_121888,Ron's Midnight Jockies,Rock n Roll,United States of America (USA),1980.0,609.0,2980.0,True,True,False,0.865245,-0.670457,1.109423,-0.014722,48.0,concert_goer_305,147.0,Austria,Did Not Enjoy
14003,ConcertExperience_99678,Puddle of Joystick,Rock n Roll,Canada,2010.0,498.0,2980.0,True,False,False,0.525536,-0.891421,1.197209,-0.827261,33.0,concert_goer_270,183.0,Canada,Enjoyed
14004,ConcertExperience_131030,Joystick of the Big Knickers,Hip Hop/Rap,United States of America (USA),1979.0,655.0,2980.0,True,True,True,0.965036,-1.446422,1.571299,0.278402,75.0,concert_goer_1275,169.0,Canada,Did Not Enjoy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169995,ConcertExperience_14055,Crazy Joystick Cult,RnB,Canada,1985.0,70.0,162754.0,True,False,False,-0.095021,0.175175,0.914245,0.357359,50.0,concert_goer_707,180.0,United States of America (USA),Did Not Enjoy
169996,ConcertExperience_192792,Crazy Joystick Cult,RnB,Canada,1985.0,963.0,54.0,False,False,False,-0.733719,-0.285776,-0.323312,0.641180,71.0,concert_goer_1373,143.0,Bulgaria,Worst Concert Ever
169997,ConcertExperience_152942,"Why Frogs, Why?",Heavy Metal,Canada,2005.0,764.0,54.0,False,False,False,0.744969,-0.965547,1.020598,1.027389,27.0,concert_goer_1286,176.0,Canada,Did Not Enjoy
169998,ConcertExperience_138957,Twilight of the Joystick Gods,Hip Hop/Rap,United States of America (USA),1995.0,694.0,22026.0,False,True,True,0.821976,0.351411,0.175762,1.455654,39.0,concert_goer_1845,176.0,Canada,Did Not Enjoy


In [124]:
valid_data

Unnamed: 0,Id,Band Name,Band Genre,Band Country of Origin,Band Debut,Concert ID,Concert Attendance,Inside Venue,Rain,Seated,Personnality Trait 1,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Concert Goer Age,Concert Goer ID,Height (cm),Concert Goer Country of Origin,Concert Enjoyment
0,ConcertExperience_180106,Teenage Crazy Blue Knickers,Indie/Alt Rock,United States of America (USA),1976.0,900.0,2980.0,False,False,,0.330843,-0.958408,-0.943548,-1.636806,29.0,concert_goer_1985,140.0,Paraguay,Did Not Enjoy
1,ConcertExperience_146268,Beyond Devon,Pop Music,United States of America (USA),1968.0,731.0,54.0,True,False,True,-2.069449,0.017777,-1.910675,0.610265,43.0,concert_goer_1874,158.0,United Kingdom (UK),Enjoyed
2,ConcertExperience_128743,Ron Talent,Rock n Roll,Canada,1955.0,103.0,403.0,False,False,True,-0.484268,1.968772,-0.064167,-1.260871,68.0,concert_goer_442,159.0,United States of America (USA),Did Not Enjoy
3,ConcertExperience_140839,Devon Revival,RnB,United States of America (USA),1992.0,704.0,8103.0,False,True,False,-0.858054,1.022827,-0.348389,-1.147251,17.0,concert_goer_1149,150.0,Canada,Worst Concert Ever
4,ConcertExperience_19149,Beyond Devon,Pop Music,United States of America (USA),1968.0,95.0,54.0,False,False,False,-0.793029,-1.166528,-0.043766,0.969661,59.0,concert_goer_930,166.0,United Kingdom (UK),Did Not Enjoy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13995,ConcertExperience_31549,Allo Knickers,Heavy Metal,United States of America (USA),1990.0,157.0,22026.0,True,False,True,0.299357,0.378583,0.783903,-1.718934,62.0,concert_goer_234,141.0,Canada,Did Not Enjoy
13996,ConcertExperience_49178,Devon Thunder Thighs,Pop Music,United States of America (USA),1984.0,245.0,8103.0,True,False,False,0.896097,-0.280567,1.157359,0.506625,68.0,concert_goer_1928,152.0,Canada,Worst Concert Ever
13997,ConcertExperience_46592,The Running Crazy Jockies,Pop Music,United States of America (USA),1988.0,232.0,162754.0,True,False,False,0.108632,2.130003,-1.579546,-0.571450,44.0,concert_goer_500,161.0,Canada,Enjoyed
13998,ConcertExperience_44132,Disciples Of Devon,Heavy Metal,United States of America (USA),1988.0,220.0,54.0,False,False,False,-0.604077,-1.422649,2.183270,1.778174,60.0,concert_goer_192,163.0,Canada,Enjoyed


## Replace "Concert Enjoyment column" with numerical values and removing 'concer_goer_' before any concert goer id

In [125]:
# Replace "Concert Enjoyment column" with numerical values
train_data['Concert Enjoyment'] = train_data['Concert Enjoyment'].replace(['Worst Concert Ever', 'Did Not Enjoy', 'Enjoyed', 'Best Concert Ever'], [0, 1, 2, 3])
valid_data['Concert Enjoyment'] = valid_data['Concert Enjoyment'].replace(['Worst Concert Ever', 'Did Not Enjoy', 'Enjoyed', 'Best Concert Ever'], [0, 1, 2, 3])

In [126]:
# Remove 'concer_goer_' before all Concer Goer ID and transform it into integer
train_data['Concert Goer ID'] = train_data['Concert Goer ID'].str.replace('concert_goer_', '')
train_data['Concert Goer ID'] = train_data['Concert Goer ID'].astype(int)

valid_data['Concert Goer ID'] = valid_data['Concert Goer ID'].str.replace('concert_goer_', '')
valid_data['Concert Goer ID'] = valid_data['Concert Goer ID'].astype(int)

In [127]:
train_data.head()

Unnamed: 0,Id,Band Name,Band Genre,Band Country of Origin,Band Debut,Concert ID,Concert Attendance,Inside Venue,Rain,Seated,Personnality Trait 1,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Concert Goer Age,Concert Goer ID,Height (cm),Concert Goer Country of Origin,Concert Enjoyment
14000,ConcertExperience_140683,Ronatron,Hip Hop/Rap,Canada,1958.0,703.0,1096.0,False,False,False,1.218456,-1.322337,-0.31882,0.012531,29.0,684,188.0,Monaco,3
14001,ConcertExperience_77278,Devon Revival,RnB,United States of America (USA),1992.0,386.0,8103.0,False,True,False,1.855238,0.436764,0.327848,0.877945,51.0,411,179.0,Canada,0
14002,ConcertExperience_121888,Ron's Midnight Jockies,Rock n Roll,United States of America (USA),1980.0,609.0,2980.0,True,True,False,0.865245,-0.670457,1.109423,-0.014722,48.0,305,147.0,Austria,1
14003,ConcertExperience_99678,Puddle of Joystick,Rock n Roll,Canada,2010.0,498.0,2980.0,True,False,False,0.525536,-0.891421,1.197209,-0.827261,33.0,270,183.0,Canada,2
14004,ConcertExperience_131030,Joystick of the Big Knickers,Hip Hop/Rap,United States of America (USA),1979.0,655.0,2980.0,True,True,True,0.965036,-1.446422,1.571299,0.278402,75.0,1275,169.0,Canada,1


In [128]:
valid_data.head()

Unnamed: 0,Id,Band Name,Band Genre,Band Country of Origin,Band Debut,Concert ID,Concert Attendance,Inside Venue,Rain,Seated,Personnality Trait 1,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Concert Goer Age,Concert Goer ID,Height (cm),Concert Goer Country of Origin,Concert Enjoyment
0,ConcertExperience_180106,Teenage Crazy Blue Knickers,Indie/Alt Rock,United States of America (USA),1976.0,900.0,2980.0,False,False,,0.330843,-0.958408,-0.943548,-1.636806,29.0,1985,140.0,Paraguay,1
1,ConcertExperience_146268,Beyond Devon,Pop Music,United States of America (USA),1968.0,731.0,54.0,True,False,True,-2.069449,0.017777,-1.910675,0.610265,43.0,1874,158.0,United Kingdom (UK),2
2,ConcertExperience_128743,Ron Talent,Rock n Roll,Canada,1955.0,103.0,403.0,False,False,True,-0.484268,1.968772,-0.064167,-1.260871,68.0,442,159.0,United States of America (USA),1
3,ConcertExperience_140839,Devon Revival,RnB,United States of America (USA),1992.0,704.0,8103.0,False,True,False,-0.858054,1.022827,-0.348389,-1.147251,17.0,1149,150.0,Canada,0
4,ConcertExperience_19149,Beyond Devon,Pop Music,United States of America (USA),1968.0,95.0,54.0,False,False,False,-0.793029,-1.166528,-0.043766,0.969661,59.0,930,166.0,United Kingdom (UK),1


## Recommendation engine

### Band name vocabulary

In [129]:
band_rating = train_data.groupby('Band Name').agg({'Concert Enjoyment': 'mean'})
train_data['Band Rating'] = train_data['Band Name'].map(band_rating['Concert Enjoyment'])
valid_data['Band Rating'] = valid_data['Band Name'].map(band_rating['Concert Enjoyment'])

In [130]:
# create a vocabulary of words with 100 most common words in "Band Name" column but remove "the" and "and" and "of" and "for" and "my" from the vocabulary
from sklearn.feature_extraction.text import CountVectorizer

X = train_data[['Band Name']]

vectorizer = CountVectorizer(max_features=10, stop_words=['the', 'and', 'of', 'for', 'my'])
X_voc = vectorizer.fit_transform(X['Band Name'])
X_voc_valid = vectorizer.transform(valid_data['Band Name'])

band_name = pd.DataFrame(X_voc.toarray(), columns=vectorizer.get_feature_names())
band_name_valid = pd.DataFrame(X_voc_valid.toarray(), columns=vectorizer.get_feature_names())



In [131]:
band_name

Unnamed: 0,big,blue,crazy,devon,frogs,jockies,joystick,knickers,ron,running
0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,1,0
3,0,0,0,0,0,0,1,0,0,0
4,1,0,0,0,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...
155995,0,0,1,0,0,0,1,0,0,0
155996,0,0,1,0,0,0,1,0,0,0
155997,0,0,0,0,1,0,0,0,0,0
155998,0,0,0,0,0,0,1,0,0,0


### Band Rating /!\

In [132]:
# Get the mean of the Enjoyment of each Concert Goer ID depending of the Band Genre
band_rate = train_data.groupby(['Band Name']).agg({'Concert Enjoyment': 'mean'})
band_rate

Unnamed: 0_level_0,Concert Enjoyment
Band Name,Unnamed: 1_level_1
A Dozen Running Iced Frogs,1.2756
Allo Knickers,1.361442
Beyond Devon,2.168493
Big Autumn,1.771886
Big Division,1.582331
Big Frogs Dream,2.107394
Big Joystick,1.85163
Crazy Crazy,0.727213
Crazy Dream and a Pinch of Joystick,0.873654
Crazy Frogs of Devon,1.352646


In [134]:
train_data['Band Rating'] = train_data['Band Name'].map(band_rate['Concert Enjoyment'])

In [135]:
valid_data['Band Rating'] = valid_data['Band Name'].map(band_rate['Concert Enjoyment'])

In [136]:
train_data

Unnamed: 0,Id,Band Name,Band Genre,Band Country of Origin,Band Debut,Concert ID,Concert Attendance,Inside Venue,Rain,Seated,Personnality Trait 1,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Concert Goer Age,Concert Goer ID,Height (cm),Concert Goer Country of Origin,Concert Enjoyment,Band Rating
14000,ConcertExperience_140683,Ronatron,Hip Hop/Rap,Canada,1958.0,703.0,1096.0,False,False,False,1.218456,-1.322337,-0.318820,0.012531,29.0,684,188.0,Monaco,3,2.081862
14001,ConcertExperience_77278,Devon Revival,RnB,United States of America (USA),1992.0,386.0,8103.0,False,True,False,1.855238,0.436764,0.327848,0.877945,51.0,411,179.0,Canada,0,0.768669
14002,ConcertExperience_121888,Ron's Midnight Jockies,Rock n Roll,United States of America (USA),1980.0,609.0,2980.0,True,True,False,0.865245,-0.670457,1.109423,-0.014722,48.0,305,147.0,Austria,1,1.344189
14003,ConcertExperience_99678,Puddle of Joystick,Rock n Roll,Canada,2010.0,498.0,2980.0,True,False,False,0.525536,-0.891421,1.197209,-0.827261,33.0,270,183.0,Canada,2,1.479306
14004,ConcertExperience_131030,Joystick of the Big Knickers,Hip Hop/Rap,United States of America (USA),1979.0,655.0,2980.0,True,True,True,0.965036,-1.446422,1.571299,0.278402,75.0,1275,169.0,Canada,1,0.918399
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169995,ConcertExperience_14055,Crazy Joystick Cult,RnB,Canada,1985.0,70.0,162754.0,True,False,False,-0.095021,0.175175,0.914245,0.357359,50.0,707,180.0,United States of America (USA),1,0.985702
169996,ConcertExperience_192792,Crazy Joystick Cult,RnB,Canada,1985.0,963.0,54.0,False,False,False,-0.733719,-0.285776,-0.323312,0.641180,71.0,1373,143.0,Bulgaria,0,0.985702
169997,ConcertExperience_152942,"Why Frogs, Why?",Heavy Metal,Canada,2005.0,764.0,54.0,False,False,False,0.744969,-0.965547,1.020598,1.027389,27.0,1286,176.0,Canada,1,0.913153
169998,ConcertExperience_138957,Twilight of the Joystick Gods,Hip Hop/Rap,United States of America (USA),1995.0,694.0,22026.0,False,True,True,0.821976,0.351411,0.175762,1.455654,39.0,1845,176.0,Canada,1,2.120677


### Concert rating /!\

In [137]:
concert_list = train_data.groupby('Concert ID').agg({'Concert Enjoyment': 'mean'})
concert_list.loc[900]

Concert Enjoyment    1.932927
Name: 900.0, dtype: float64

In [138]:
concert_list

Unnamed: 0_level_0,Concert Enjoyment
Concert ID,Unnamed: 1_level_1
0.0,1.070968
1.0,2.203947
2.0,0.993671
3.0,1.040000
4.0,1.411765
...,...
995.0,1.214286
996.0,1.103226
997.0,1.546099
998.0,1.880795


In [139]:
concert_list.loc[350]

Concert Enjoyment    2.278912
Name: 350.0, dtype: float64

In [140]:
# Add a new column "Concert Rating" to train_data which is the average of "Concert Enjoyment" for each "Concert ID"

train_data['Concert Rating'] = train_data['Concert ID'].map(concert_list['Concert Enjoyment'])
train_data

Unnamed: 0,Id,Band Name,Band Genre,Band Country of Origin,Band Debut,Concert ID,Concert Attendance,Inside Venue,Rain,Seated,...,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Concert Goer Age,Concert Goer ID,Height (cm),Concert Goer Country of Origin,Concert Enjoyment,Band Rating,Concert Rating
14000,ConcertExperience_140683,Ronatron,Hip Hop/Rap,Canada,1958.0,703.0,1096.0,False,False,False,...,-1.322337,-0.318820,0.012531,29.0,684,188.0,Monaco,3,2.081862,2.056604
14001,ConcertExperience_77278,Devon Revival,RnB,United States of America (USA),1992.0,386.0,8103.0,False,True,False,...,0.436764,0.327848,0.877945,51.0,411,179.0,Canada,0,0.768669,0.474026
14002,ConcertExperience_121888,Ron's Midnight Jockies,Rock n Roll,United States of America (USA),1980.0,609.0,2980.0,True,True,False,...,-0.670457,1.109423,-0.014722,48.0,305,147.0,Austria,1,1.344189,1.302817
14003,ConcertExperience_99678,Puddle of Joystick,Rock n Roll,Canada,2010.0,498.0,2980.0,True,False,False,...,-0.891421,1.197209,-0.827261,33.0,270,183.0,Canada,2,1.479306,1.627451
14004,ConcertExperience_131030,Joystick of the Big Knickers,Hip Hop/Rap,United States of America (USA),1979.0,655.0,2980.0,True,True,True,...,-1.446422,1.571299,0.278402,75.0,1275,169.0,Canada,1,0.918399,0.927711
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169995,ConcertExperience_14055,Crazy Joystick Cult,RnB,Canada,1985.0,70.0,162754.0,True,False,False,...,0.175175,0.914245,0.357359,50.0,707,180.0,United States of America (USA),1,0.985702,0.843537
169996,ConcertExperience_192792,Crazy Joystick Cult,RnB,Canada,1985.0,963.0,54.0,False,False,False,...,-0.285776,-0.323312,0.641180,71.0,1373,143.0,Bulgaria,0,0.985702,1.151724
169997,ConcertExperience_152942,"Why Frogs, Why?",Heavy Metal,Canada,2005.0,764.0,54.0,False,False,False,...,-0.965547,1.020598,1.027389,27.0,1286,176.0,Canada,1,0.913153,0.666667
169998,ConcertExperience_138957,Twilight of the Joystick Gods,Hip Hop/Rap,United States of America (USA),1995.0,694.0,22026.0,False,True,True,...,0.351411,0.175762,1.455654,39.0,1845,176.0,Canada,1,2.120677,1.288732


In [141]:
valid_data['Concert Rating'] = valid_data['Concert ID'].map(concert_list['Concert Enjoyment'])
valid_data

Unnamed: 0,Id,Band Name,Band Genre,Band Country of Origin,Band Debut,Concert ID,Concert Attendance,Inside Venue,Rain,Seated,...,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Concert Goer Age,Concert Goer ID,Height (cm),Concert Goer Country of Origin,Concert Enjoyment,Band Rating,Concert Rating
0,ConcertExperience_180106,Teenage Crazy Blue Knickers,Indie/Alt Rock,United States of America (USA),1976.0,900.0,2980.0,False,False,,...,-0.958408,-0.943548,-1.636806,29.0,1985,140.0,Paraguay,1,1.571156,1.932927
1,ConcertExperience_146268,Beyond Devon,Pop Music,United States of America (USA),1968.0,731.0,54.0,True,False,True,...,0.017777,-1.910675,0.610265,43.0,1874,158.0,United Kingdom (UK),2,2.168493,2.192771
2,ConcertExperience_128743,Ron Talent,Rock n Roll,Canada,1955.0,103.0,403.0,False,False,True,...,1.968772,-0.064167,-1.260871,68.0,442,159.0,United States of America (USA),1,0.707779,0.617834
3,ConcertExperience_140839,Devon Revival,RnB,United States of America (USA),1992.0,704.0,8103.0,False,True,False,...,1.022827,-0.348389,-1.147251,17.0,1149,150.0,Canada,0,0.768669,0.321839
4,ConcertExperience_19149,Beyond Devon,Pop Music,United States of America (USA),1968.0,95.0,54.0,False,False,False,...,-1.166528,-0.043766,0.969661,59.0,930,166.0,United Kingdom (UK),1,2.168493,2.019608
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13995,ConcertExperience_31549,Allo Knickers,Heavy Metal,United States of America (USA),1990.0,157.0,22026.0,True,False,True,...,0.378583,0.783903,-1.718934,62.0,234,141.0,Canada,1,1.361442,1.391892
13996,ConcertExperience_49178,Devon Thunder Thighs,Pop Music,United States of America (USA),1984.0,245.0,8103.0,True,False,False,...,-0.280567,1.157359,0.506625,68.0,1928,152.0,Canada,0,1.105906,1.207143
13997,ConcertExperience_46592,The Running Crazy Jockies,Pop Music,United States of America (USA),1988.0,232.0,162754.0,True,False,False,...,2.130003,-1.579546,-0.571450,44.0,500,161.0,Canada,2,2.349730,2.353659
13998,ConcertExperience_44132,Disciples Of Devon,Heavy Metal,United States of America (USA),1988.0,220.0,54.0,False,False,False,...,-1.422649,2.183270,1.778174,60.0,192,163.0,Canada,2,1.374009,1.340278


## Beginning of Processing

In [142]:
# Extract the column "Concert ID", "Concert Goer ID", "Concert Rating" from train_data
concert_info = train_data[['Band Name', 'Band Genre', 'Band Country of Origin', 'Band Debut', 'Concert Attendance', 'Inside Venue', 'Rain', 'Concert ID', 'Concert Goer ID', 'Concert Goer Age', 'Seated', 'Personnality Trait 1', 'Personnality Trait 2', 'Personnality Trait 3', 'Personnality Trait 4', 'Height (cm)', 'Concert Goer Country of Origin', 'Concert Rating', 'Band Rating', 'Concert Enjoyment']]
concert_info

Unnamed: 0,Band Name,Band Genre,Band Country of Origin,Band Debut,Concert Attendance,Inside Venue,Rain,Concert ID,Concert Goer ID,Concert Goer Age,Seated,Personnality Trait 1,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Height (cm),Concert Goer Country of Origin,Concert Rating,Band Rating,Concert Enjoyment
14000,Ronatron,Hip Hop/Rap,Canada,1958.0,1096.0,False,False,703.0,684,29.0,False,1.218456,-1.322337,-0.318820,0.012531,188.0,Monaco,2.056604,2.081862,3
14001,Devon Revival,RnB,United States of America (USA),1992.0,8103.0,False,True,386.0,411,51.0,False,1.855238,0.436764,0.327848,0.877945,179.0,Canada,0.474026,0.768669,0
14002,Ron's Midnight Jockies,Rock n Roll,United States of America (USA),1980.0,2980.0,True,True,609.0,305,48.0,False,0.865245,-0.670457,1.109423,-0.014722,147.0,Austria,1.302817,1.344189,1
14003,Puddle of Joystick,Rock n Roll,Canada,2010.0,2980.0,True,False,498.0,270,33.0,False,0.525536,-0.891421,1.197209,-0.827261,183.0,Canada,1.627451,1.479306,2
14004,Joystick of the Big Knickers,Hip Hop/Rap,United States of America (USA),1979.0,2980.0,True,True,655.0,1275,75.0,True,0.965036,-1.446422,1.571299,0.278402,169.0,Canada,0.927711,0.918399,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169995,Crazy Joystick Cult,RnB,Canada,1985.0,162754.0,True,False,70.0,707,50.0,False,-0.095021,0.175175,0.914245,0.357359,180.0,United States of America (USA),0.843537,0.985702,1
169996,Crazy Joystick Cult,RnB,Canada,1985.0,54.0,False,False,963.0,1373,71.0,False,-0.733719,-0.285776,-0.323312,0.641180,143.0,Bulgaria,1.151724,0.985702,0
169997,"Why Frogs, Why?",Heavy Metal,Canada,2005.0,54.0,False,False,764.0,1286,27.0,False,0.744969,-0.965547,1.020598,1.027389,176.0,Canada,0.666667,0.913153,1
169998,Twilight of the Joystick Gods,Hip Hop/Rap,United States of America (USA),1995.0,22026.0,False,True,694.0,1845,39.0,True,0.821976,0.351411,0.175762,1.455654,176.0,Canada,1.288732,2.120677,1


In [143]:
concert_info_valid = valid_data[['Band Name', 'Band Genre', 'Band Country of Origin', 'Band Debut', 'Concert Attendance', 'Inside Venue', 'Rain', 'Concert ID', 'Concert Goer ID', 'Concert Goer Age', 'Seated', 'Personnality Trait 1', 'Personnality Trait 2', 'Personnality Trait 3', 'Personnality Trait 4', 'Height (cm)', 'Concert Goer Country of Origin', 'Concert Rating', 'Band Rating']]
concert_info_valid

Unnamed: 0,Band Name,Band Genre,Band Country of Origin,Band Debut,Concert Attendance,Inside Venue,Rain,Concert ID,Concert Goer ID,Concert Goer Age,Seated,Personnality Trait 1,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Height (cm),Concert Goer Country of Origin,Concert Rating,Band Rating
0,Teenage Crazy Blue Knickers,Indie/Alt Rock,United States of America (USA),1976.0,2980.0,False,False,900.0,1985,29.0,,0.330843,-0.958408,-0.943548,-1.636806,140.0,Paraguay,1.932927,1.571156
1,Beyond Devon,Pop Music,United States of America (USA),1968.0,54.0,True,False,731.0,1874,43.0,True,-2.069449,0.017777,-1.910675,0.610265,158.0,United Kingdom (UK),2.192771,2.168493
2,Ron Talent,Rock n Roll,Canada,1955.0,403.0,False,False,103.0,442,68.0,True,-0.484268,1.968772,-0.064167,-1.260871,159.0,United States of America (USA),0.617834,0.707779
3,Devon Revival,RnB,United States of America (USA),1992.0,8103.0,False,True,704.0,1149,17.0,False,-0.858054,1.022827,-0.348389,-1.147251,150.0,Canada,0.321839,0.768669
4,Beyond Devon,Pop Music,United States of America (USA),1968.0,54.0,False,False,95.0,930,59.0,False,-0.793029,-1.166528,-0.043766,0.969661,166.0,United Kingdom (UK),2.019608,2.168493
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13995,Allo Knickers,Heavy Metal,United States of America (USA),1990.0,22026.0,True,False,157.0,234,62.0,True,0.299357,0.378583,0.783903,-1.718934,141.0,Canada,1.391892,1.361442
13996,Devon Thunder Thighs,Pop Music,United States of America (USA),1984.0,8103.0,True,False,245.0,1928,68.0,False,0.896097,-0.280567,1.157359,0.506625,152.0,Canada,1.207143,1.105906
13997,The Running Crazy Jockies,Pop Music,United States of America (USA),1988.0,162754.0,True,False,232.0,500,44.0,False,0.108632,2.130003,-1.579546,-0.571450,161.0,Canada,2.353659,2.349730
13998,Disciples Of Devon,Heavy Metal,United States of America (USA),1988.0,54.0,False,False,220.0,192,60.0,False,-0.604077,-1.422649,2.183270,1.778174,163.0,Canada,1.340278,1.374009


In [144]:
#concert_info.dropna(inplace=True)

In [145]:
# Convert "Seated" into binary
concert_info['Seated'] = concert_info['Seated'].replace(['True', 'False'], [1, 0])
concert_info_valid['Seated'] = concert_info_valid['Seated'].replace(['True', 'False'], [1, 0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  concert_info['Seated'] = concert_info['Seated'].replace(['True', 'False'], [1, 0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  concert_info_valid['Seated'] = concert_info_valid['Seated'].replace(['True', 'False'], [1, 0])


In [146]:
concert_info

Unnamed: 0,Band Name,Band Genre,Band Country of Origin,Band Debut,Concert Attendance,Inside Venue,Rain,Concert ID,Concert Goer ID,Concert Goer Age,Seated,Personnality Trait 1,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Height (cm),Concert Goer Country of Origin,Concert Rating,Band Rating,Concert Enjoyment
14000,Ronatron,Hip Hop/Rap,Canada,1958.0,1096.0,False,False,703.0,684,29.0,False,1.218456,-1.322337,-0.318820,0.012531,188.0,Monaco,2.056604,2.081862,3
14001,Devon Revival,RnB,United States of America (USA),1992.0,8103.0,False,True,386.0,411,51.0,False,1.855238,0.436764,0.327848,0.877945,179.0,Canada,0.474026,0.768669,0
14002,Ron's Midnight Jockies,Rock n Roll,United States of America (USA),1980.0,2980.0,True,True,609.0,305,48.0,False,0.865245,-0.670457,1.109423,-0.014722,147.0,Austria,1.302817,1.344189,1
14003,Puddle of Joystick,Rock n Roll,Canada,2010.0,2980.0,True,False,498.0,270,33.0,False,0.525536,-0.891421,1.197209,-0.827261,183.0,Canada,1.627451,1.479306,2
14004,Joystick of the Big Knickers,Hip Hop/Rap,United States of America (USA),1979.0,2980.0,True,True,655.0,1275,75.0,True,0.965036,-1.446422,1.571299,0.278402,169.0,Canada,0.927711,0.918399,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169995,Crazy Joystick Cult,RnB,Canada,1985.0,162754.0,True,False,70.0,707,50.0,False,-0.095021,0.175175,0.914245,0.357359,180.0,United States of America (USA),0.843537,0.985702,1
169996,Crazy Joystick Cult,RnB,Canada,1985.0,54.0,False,False,963.0,1373,71.0,False,-0.733719,-0.285776,-0.323312,0.641180,143.0,Bulgaria,1.151724,0.985702,0
169997,"Why Frogs, Why?",Heavy Metal,Canada,2005.0,54.0,False,False,764.0,1286,27.0,False,0.744969,-0.965547,1.020598,1.027389,176.0,Canada,0.666667,0.913153,1
169998,Twilight of the Joystick Gods,Hip Hop/Rap,United States of America (USA),1995.0,22026.0,False,True,694.0,1845,39.0,True,0.821976,0.351411,0.175762,1.455654,176.0,Canada,1.288732,2.120677,1


## One Hot Encoding

In [147]:
from sklearn.preprocessing import OneHotEncoder

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(concert_info[['Band Name', 'Band Genre', 'Band Country of Origin', "Concert Goer Country of Origin"]]))

def OneHot(concert_info):

    # Apply ont hot ender to each column with categorical data
    OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
    OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(concert_info[['Band Name', 'Band Genre', 'Band Country of Origin', "Concert Goer Country of Origin"]]))

    # One hot encoding removed index; put it back; let the band name be the header without the name of the category
    OH_cols_train.index = concert_info.index
    OH_cols_train.columns = OH_encoder.get_feature_names(['Band Name', 'Band Genre', 'Band Country of Origin', "Concert Goer Country of Origin"])
    # Modify all the column names to remove the 'Band Name', 'Band Genre', 'Band Country of Origin' and 'Concert Goer Country of Origin' from the name
    OH_cols_train.columns = OH_cols_train.columns.str.replace('Band Name_', '')
    OH_cols_train.columns = OH_cols_train.columns.str.replace('Band Genre_', '')
    OH_cols_train.columns = OH_cols_train.columns.str.replace('Band Country of Origin_', '')
    OH_cols_train.columns = OH_cols_train.columns.str.replace('Concert Goer Country of Origin_', '')

    # Remove categorical columns (will replace with one-hot encoding)
    numeric_train_data = concert_info.drop(['Band Name', 'Band Genre', 'Band Country of Origin', "Concert Goer Country of Origin"], axis=1)

    # Add one-hot encoded columns to numerical features but let the name of the band as header
    concert_info = pd.concat([OH_cols_train,numeric_train_data], axis=1)

    return concert_info

In [148]:
concert_info = OneHot(concert_info)
concert_info_valid = OneHot(concert_info_valid)



In [149]:
concert_info

Unnamed: 0,A Dozen Running Iced Frogs,Allo Knickers,Beyond Devon,Big Autumn,Big Division,Big Frogs Dream,Big Joystick,Crazy Crazy,Crazy Dream and a Pinch of Joystick,Crazy Frogs of Devon,...,Concert Goer Age,Seated,Personnality Trait 1,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Height (cm),Concert Rating,Band Rating,Concert Enjoyment
14000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,29.0,False,1.218456,-1.322337,-0.318820,0.012531,188.0,2.056604,2.081862,3
14001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,51.0,False,1.855238,0.436764,0.327848,0.877945,179.0,0.474026,0.768669,0
14002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,48.0,False,0.865245,-0.670457,1.109423,-0.014722,147.0,1.302817,1.344189,1
14003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,33.0,False,0.525536,-0.891421,1.197209,-0.827261,183.0,1.627451,1.479306,2
14004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,75.0,True,0.965036,-1.446422,1.571299,0.278402,169.0,0.927711,0.918399,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,50.0,False,-0.095021,0.175175,0.914245,0.357359,180.0,0.843537,0.985702,1
169996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,71.0,False,-0.733719,-0.285776,-0.323312,0.641180,143.0,1.151724,0.985702,0
169997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,27.0,False,0.744969,-0.965547,1.020598,1.027389,176.0,0.666667,0.913153,1
169998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,39.0,True,0.821976,0.351411,0.175762,1.455654,176.0,1.288732,2.120677,1


In [150]:
concert_info_valid

Unnamed: 0,A Dozen Running Iced Frogs,Allo Knickers,Beyond Devon,Big Autumn,Big Division,Big Frogs Dream,Big Joystick,Crazy Crazy,Crazy Dream and a Pinch of Joystick,Crazy Frogs of Devon,...,Concert Goer ID,Concert Goer Age,Seated,Personnality Trait 1,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Height (cm),Concert Rating,Band Rating
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1985,29.0,,0.330843,-0.958408,-0.943548,-1.636806,140.0,1.932927,1.571156
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1874,43.0,True,-2.069449,0.017777,-1.910675,0.610265,158.0,2.192771,2.168493
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,442,68.0,True,-0.484268,1.968772,-0.064167,-1.260871,159.0,0.617834,0.707779
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1149,17.0,False,-0.858054,1.022827,-0.348389,-1.147251,150.0,0.321839,0.768669
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,930,59.0,False,-0.793029,-1.166528,-0.043766,0.969661,166.0,2.019608,2.168493
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13995,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,234,62.0,True,0.299357,0.378583,0.783903,-1.718934,141.0,1.391892,1.361442
13996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1928,68.0,False,0.896097,-0.280567,1.157359,0.506625,152.0,1.207143,1.105906
13997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,500,44.0,False,0.108632,2.130003,-1.579546,-0.571450,161.0,2.353659,2.349730
13998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,192,60.0,False,-0.604077,-1.422649,2.183270,1.778174,163.0,1.340278,1.374009


---

## Concat for band name vocabulary

In [152]:
'''
concert_info.reset_index(drop=True, inplace=True)
concert_info = pd.concat([band_name, concert_info], axis=1)
concert_info.drop(['Band Name'], axis=1, inplace=True)
'''

"\nconcert_info.reset_index(drop=True, inplace=True)\nconcert_info = pd.concat([band_name, concert_info], axis=1)\nconcert_info.drop(['Band Name'], axis=1, inplace=True)\n"

In [153]:
'''
concert_info_valid.reset_index(drop=True, inplace=True)
concert_info_valid = pd.concat([band_name_valid, concert_info_valid], axis=1)
concert_info_valid.drop(['Band Name'], axis=1, inplace=True)
'''

"\nconcert_info_valid.reset_index(drop=True, inplace=True)\nconcert_info_valid = pd.concat([band_name_valid, concert_info_valid], axis=1)\nconcert_info_valid.drop(['Band Name'], axis=1, inplace=True)\n"

---

In [154]:
concert_info.dropna(inplace=True)

In [155]:
X = concert_info.drop(['Concert Enjoyment'], axis=1)
y = concert_info['Concert Enjoyment']



## Normalization

In [156]:
def d_normalize(df):
    df['Band Debut'] = (df['Band Debut'] - df['Band Debut'].mean()) / df['Band Debut'].std()
    df['Concert ID'] = (df['Concert ID'] - df['Concert ID'].mean()) / df['Concert ID'].std()
    df['Concert Attendance'] = (df['Concert Attendance'] - df['Concert Attendance'].mean()) / df['Concert Attendance'].std()
    df['Personnality Trait 1'] = (df['Personnality Trait 1'] - df['Personnality Trait 1'].mean()) / df['Personnality Trait 1'].std()
    df['Personnality Trait 2'] = (df['Personnality Trait 2'] - df['Personnality Trait 2'].mean()) / df['Personnality Trait 2'].std()
    df['Personnality Trait 3'] = (df['Personnality Trait 3'] - df['Personnality Trait 3'].mean()) / df['Personnality Trait 3'].std()
    df['Personnality Trait 4'] = (df['Personnality Trait 4'] - df['Personnality Trait 4'].mean()) / df['Personnality Trait 4'].std()
    df['Concert Goer Age'] = (df['Concert Goer Age'] - df['Concert Goer Age'].mean()) / df['Concert Goer Age'].std()
    df['Concert Goer ID'] = (df['Concert Goer ID'] - df['Concert Goer ID'].mean()) / df['Concert Goer ID'].std()
    df['Height (cm)'] = (df['Height (cm)'] - df['Height (cm)'].mean()) / df['Height (cm)'].std()
    df['Concert Goer ID'] = (df['Concert Goer ID'] - df['Concert Goer ID'].mean()) / df['Concert Goer ID'].std()
    df['Concert Rating'] = (df['Concert Rating'] - df['Concert Rating'].mean()) / df['Concert Rating'].std()
    df['Band Rating'] = (df['Band Rating'] - df['Band Rating'].mean()) / df['Band Rating'].std()
    return df

In [157]:
X = d_normalize(X)

In [158]:
X_valid = d_normalize(concert_info_valid)

In [159]:
X = X.to_numpy()
y = y.to_numpy()

In [160]:
'''
X = concert_info.drop(['Concert Enjoyment'], axis=1)
y = concert_info['Concert Enjoyment']
'''

"\nX = concert_info.drop(['Concert Enjoyment'], axis=1)\ny = concert_info['Concert Enjoyment']\n"

In [161]:
from sklearn.metrics import f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import BaggingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split

# Process test data
val_size = 0.2
seed = 7
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=val_size, random_state=seed)

### Test avec Random Forest

----

In [None]:
# Random forest parameters
n_estimators = [10, 100, 1000]
max_depth = [30, 60, 90, 200, 500, 1000]
n_jobs = [-1]

best_params_rf = {'n_estimators': 0, 'max_depth': 0, 'n_jobs': 0, 'f1_score': 0}

In [None]:
for n in n_estimators:
    for m in max_depth:
        for j in n_jobs:
            rf = RandomForestClassifier(n_estimators=n, max_depth=m, n_jobs=j)
            rf.fit(X_train, y_train)
            y_pred = rf.predict(X_test)
            f1 = f1_score(y_test, y_pred)
            print('n_estimators: ', n, 'max_depth: ', m, 'n_jobs: ', j, 'f1_score: ', f1)
            if f1 > best_params_rf['f1_score']:
                best_params_rf['n_estimators'] = n
                best_params_rf['max_depth'] = m
                best_params_rf['n_jobs'] = j
                best_params_rf['f1_score'] = f1

---

### KNN

In [None]:
# KNN parameters
n_neighbors = [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 115, 130, 140, 150]
weights = ['uniform', 'distance']
metric = ['minkowski', 'euclidean', 'manhattan']
n_jobs = [-1]

best_params_knn = {'n_neighbors': 0, 'weights': 0, 'metric': 0, 'n_jobs': 0, 'f1_score': 0}

In [None]:
for n in n_neighbors:
    for w in weights:
        for m in metric:
            for j in n_jobs:
                knn = KNeighborsClassifier(n_neighbors=n, weights=w, metric=m, n_jobs=j)
                knn.fit(X_train, y_train)
                y_pred = knn.predict(X_test)
                f1 = f1_score(y_test, y_pred)
                print('n_neighbors: ', n, 'weights: ', w, 'metric: ', m, 'n_jobs: ', j, 'f1_score: ', f1)
                if f1 > best_params_knn['f1_score']:
                    best_params_knn['n_neighbors'] = n
                    best_params_knn['weights'] = w
                    best_params_knn['metric'] = m
                    best_params_knn['n_jobs'] = j
                    best_params_knn['f1_score'] = f1

### Gradient Boosting

In [None]:
# Gradient boosting parameters

n_estimators = [500, 1000, 1500]
learning_rate = [0.01, 0.1, 0.5]
n_jobs = [-1]

best_params_gb = {'n_estimators': 0, 'learning_rate': 0, 'n_jobs': 0, 'f1_score': 0}

In [None]:
for n in n_estimators:
    for l in learning_rate:
        for j in n_jobs:
            gb = GradientBoostingClassifier(n_estimators=n, learning_rate=l, n_jobs=j)
            gb.fit(X_train, y_train)
            y_pred = gb.predict(X_test)
            f1 = f1_score(y_test, y_pred)
            print('n_estimators: ', n, 'learning_rate: ', l, 'n_jobs: ', j, 'f1_score: ', f1)
            if f1 > best_params_gb['f1_score']:
                best_params_gb['n_estimators'] = n
                best_params_gb['learning_rate'] = l
                best_params_gb['n_jobs'] = j
                best_params_gb['f1_score'] = f1

### Extra Trees

In [None]:
# Extra trees parameters

n_estimators = [500, 1000, 1500]
max_depth = [3, 5, 7, 10]
n_jobs = [-1]

best_params_et = {'n_estimators': 0, 'max_depth': 0, 'n_jobs': 0, 'f1_score': 0}

In [None]:
for n in n_estimators:
    for m in max_depth:
        for j in n_jobs:
            et = ExtraTreesClassifier(n_estimators=n, max_depth=m, n_jobs=j)
            et.fit(X_train, y_train)
            y_pred = et.predict(X_test)
            f1 = f1_score(y_test, y_pred)
            print('n_estimators: ', n, 'max_depth: ', m, 'n_jobs: ', j, 'f1_score: ', f1)
            if f1 > best_params_et['f1_score']:
                best_params_et['n_estimators'] = n
                best_params_et['max_depth'] = m
                best_params_et['n_jobs'] = j
                best_params_et['f1_score'] = f1

In [None]:
print('Random forest best parameters: ', best_params_rf)
print('KNN best parameters: ', best_params_knn)
print('Gradient boosting best parameters: ', best_params_gb)
print('Extra trees best parameters: ', best_params_et)