# Analyzing Music in the Big City

In this project, I will compare music preferences in the cities of Springfield and Shelbyville, and study actual Y.Music data to test the hypotheses below, as well as compare user behavior in those two cities.

<b>Hypothesis:</b>

   1. User activity different on each depends on day and city.
   2. At monday morning, the citizen of Springfield and Shelbyville listen to different genre. It is also happen at friday night. 
   3. User from Springfield and Shelbyville have different preferences. On Springfield, they prefer pop music, while in Shelbyville, rap music has a lot of fans.

<b>Stages:</b>

1. [Data Overview](#start)
2. [Pre-process The Data](#pre-process)
    - [Column titles](#column)
    - [Missing values](#missing)
    - [Duplicates](#duplicate)
3. [Hypothesis Test](#test)
    - [Hypothesis 1: comparing user behavior in two cities](#1)
    - [Hypothesis 2: music at the beginning and end of the week](#2)
    - [Hypothesis 3: genre preferences in springfield and shelbyville](#3)
4. [General Conclusion](#conclusion)

## Data Overview <a id="start"></a>

In [10]:
# Import pandas library
import pandas as pd

In [11]:
# Read file from directory and show the information related to the file
df=pd.read_csv('Y:\\Online Course\\Practicum\\Jupyter Notebook\\1 Project\\music_project_en.csv')

In [12]:
# Documentation and show the Data Frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


In [13]:
# Dataset size
df.shape

(65079, 7)

In [14]:
# First 10 row of the Data Frame
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


Documentantions:
- `' userID'` - user ID
- `'Track'` - track tittle
- `'artist'` - artist name
- `'genre'` - music genre
- `' City'` - city where the user lives
- `'time'` - how long does the song play
- `'Day'` - name of the day

**Tentative conclusions**

Findings:

1. There are multiple duplicate values,
2. Column naming that can be fixed because it uses spaces and also uses snake case,
3. There are several columns that have missing values.

## Pre-process The Data <a id="pre-process"></a>

### Columns titles <a id="column"></a>

In [15]:
# Columns tittle
df_clean = df
df_clean.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

There are some mistakes in the title of the columns

In [16]:
# Fixing columns tittle
df_clean = df_clean.rename(columns ={
    '  userID' : 'user_id',
    'Track' : 'track',
    '  City  ': 'city',
    'Day' : 'day'
})

In [17]:
# Check the new column
df_clean.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

Column names `used_id`, `track`, `city`, `day` have been successfully changed to lowercase and removed spaces.

### Missing values <a id="missing"></a>

In [18]:
# Count the missing value
df_clean.isna().sum()

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

In [19]:
# Fill the missing value with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']

for cols in columns_to_replace:
    df_clean[cols] =df_clean[cols].fillna('unknown')

In [20]:
# Check fixed value
df_clean.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

Missing values from the `track`, `artist`, `genre` fields have been successfully fixed, replacing them with the string 'unknown'.

### Duplicates <a id="duplicate"></a>

In [21]:
# Explicit (Data Frame): Counting duplicate value
df_clean.duplicated().sum()

3826

In [22]:
# Explicit : Clean up duplicate value
df_clean = df_clean.drop_duplicates().reset_index(drop=True)

In [13]:
# Explicit : Check duplicate value
df_clean.duplicated().sum()

0

In [23]:
# Finds the number of implicit duplicates in a table and fixes it
# Implicit (Series): Counting duplicate value
df_clean['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

In [25]:
# There are implicit duplicates in the strings hip, hop, hip-hop
# Implicit (Series): Function to fix duplicate value
'''
Definisi:
-----------
    # Implicit (Series): Function to fix duplicate value
-----------
    wrong_genres:
        the string you want to replace
    correct_genre:
        string to replace
'''
def replace_wrong_genres(wrong_genres, correct_genre):
    
    for wrong_genre in wrong_genres:
        df_clean['genre'] = df_clean['genre'].replace(wrong_genres, correct_genre)


In [16]:
# Implicit (Series): Recall the function
duplicates_genres = ['hip', 'hop', 'hip-hop']
genre_replace = 'hiphop'
replace_wrong_genres(duplicates_genres, genre_replace)

In [17]:
# Implicit (Series): Check duplicate value
df_clean['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

Either explicit duplicates are corrected by removing duplicated rows or implicit ones are corrected by replacing the duplicated value with a string.

**Tentative conclusions**

*Findings*:

- The title writing style using *snake case* is changed to *lower case*, spaces are also removed,
- All missing values have been replaced with `'unknown'`,
- Fixed explicit and implicit duplicates.

## Hypothesis Test <a id="test"></a>

### Hypothesis 1: comparing user behavior in two cities <a id="1"></a>

Based on the first hypothesis, the user from Springfield and Shelbyville have two different behaviour when they listen to music. This test will take from the data on Monday, Wednesday, and Friday.

In [18]:
# Count every music play at each city
df_test = df_clean
print(df_test.groupby('city')['track'].count())

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64


Springfield has more songs playing than Shelbyville. But that doesn't mean Springfield residents listen to music more often. The city is bigger, and has more users.

In [19]:
# Count the track that play on each day 
print(df_test.groupby('day')['track'].count())

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64


Wednesday is the quietest day overall. But if consider two cities separately, maybe it might come to a different conclusion.

In [26]:
# Function to Count every music play at each city and day
'''
Definisi:
-----------
    Function to Count every music play at each city and day
-----------
    day:
        desired day
    city:
        desired city
'''
def number_tracks(day,city):
    
    track_list = df_test[(df_test['day'] == day) & (df_test['city'] == city)]['user_id']
    track_list_count = track_list.count()
    
    return print(track_list_count)

Use the function six times and replace the parameter, so that it can retrive data in the two cities for each of those day.

In [21]:
# Total music played in Springfield at Monday
number_tracks ('Monday', 'Springfield')

15740


In [22]:
# Total music played in Shelbyville at Monday
number_tracks ('Monday','Shelbyville')

5614


In [23]:
# Total music played in Springfield at Monday
number_tracks ('Wednesday','Springfield')

11056


In [24]:
# Total music played in Shelbyville at Monday
number_tracks ('Wednesday','Shelbyville')

7003


In [25]:
# Total music played in Springfield at Monday
number_tracks ('Friday','Springfield')

15945


In [26]:
# Total music played in Shelbyville at Monday
number_tracks ('Friday','Shelbyville')

5895


In [27]:
# Table with result
new_data = [['Springfield', 15740, 11056, 15945],
           ['Shelbyville', 5614, 7003, 5895]]

column_names = ['city', 'monday', 'wednesday', 'friday']

table = pd.DataFrame (data = new_data, columns = column_names)
print(table)

          city  monday  wednesday  friday
0  Springfield   15740      11056   15945
1  Shelbyville    5614       7003    5895


**Tentative conclusions**

*Findings*:

1. In Springfield, the number of songs played peaked on Monday and Friday, while on Wednesday there was a decrease in activity.
2. In Shelbyville, on the other hand, users listen to more music on Wednesdays. Less user activity on Mondays and Fridays.

### Hypothesis 2: music at the beginning and end of the week <a id="2"></a>

Based on the second hypothesis, on Monday morning and Friday night, the user from Springfield listened to different genres compared to the Shelbyville user.

In [28]:
# Data Frame that contain Springfield only
spr_general = df_test[df_test['city']=='Springfield']

In [29]:
# Data Frame that contain Shelbyville only
shel_general = df_test[df_test['city']=='Shelbyville']

In [27]:
# Function that filter 15 popular genre at given time
'''
Definisi:
-----------
    Function that filter 15 popular genre at given time
-----------
    df1:
        desired dataset
    day:
        desired day
    time1:
        desired start time
    time2:
        desired end time
'''
def genre_weekday(df1, day, time1, time2):

    genre_df = df1[df1['day']==day]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df = genre_df[genre_df['time'] > time1]

    genre_df_grouped = genre_df.groupby('genre')['genre'].count()

    genre_df_sorted = genre_df_grouped.sort_values(ascending = False)

    return genre_df_sorted[:15]

Compare results from the function genre_weekday() for Springfield and Shelbyville on Monday morning (from 07:00 to 11:00) and on Friday night (from 17:00 to 23:00)

In [31]:
# Music play at monday morning from 07:00 to 11:00 in Springfield
genre_weekday(spr_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: genre, dtype: int64

In [32]:
# Music play at monday morning from 07:00 to 11:00 in Shelbyville
genre_weekday(shel_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: genre, dtype: int64

In [33]:
# Music play at monday morning from 07:00 to 11:00 in Springfield
genre_weekday(spr_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: genre, dtype: int64

In [34]:
# Music play at monday morning from 07:00 to 11:00 in Shelbyville
genre_weekday(shel_general, 'Friday', '17:00', '23:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: genre, dtype: int64

**Tentative conclusions**

*Findings*:

After comparing the top 15 genres on Monday morning, the data came to the following conclusions:

1. Users from Springfield and Shelbyville listen to music of the same genre. The top five genres are the same, only rock and electronic have switched places.

2. In Springfield, the number of missing values was so large that the value `'unknown'` was 10th. This means that the missing values contain a sizable amount of data, which may be grounds for questioning the precision of our conclusions.

For Friday night, the situation is similar. Individual genres vary quite a bit, but overall, the top 15 genres for both cities are the same.

*Insights*:

Thus, the second hypothesis is partially proven correct:

1. Users listen to the same music at the beginning and end of the week.
2. There is no significant difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result questionable. In Springfield, there's so much that influences the top 15. If you don't ignore these values, the results might be different.

### Hypothesis 3: genre preferences in springfield and shelbyville <a id="3"></a>

Shelbyville users like rap music. Springfield users are more like pop music.

In [35]:
# Group the dataframe at Springfield by column genre and count the total
new = spr_general.groupby('genre')['genre'].count()
spr_genres = new.sort_values(ascending=False)

In [36]:
# First 10 row on spr_genres
spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [37]:
# Group the data frame at Shelbyville by column genre and count the total
new1 = shel_general.groupby('genre')['genre'].count()
shel_genres = new1.sort_values(ascending=False)

In [38]:
# First 10 row on shel_genres
shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Tentative conclusions**

*Findings*:

The hypothesis is partially proven,

1. Pop music is the most popular genre in Springfield, as one would expect.
2. However, pop music turned out to be equally popular in both Springfield and Shelbyville, and rap music was not in the top 5 for either city.

# General Conclusion <a id="conclusion"></a>

After testing the hypothesis:
    
1. User activity is different on each depending on the day and city.
2. On Monday morning, the citizen of Springfield and Shelbyville listen to different genres. It also happens on Friday night. 
3. Users from Springfield and Shelbyville have different preferences. In Springfield, they prefer pop music, while in Shelbyville, rap music has a lot of fans.
    
After analysing the data, the conclusion:

1. User activity in Springfield and Shelbyville depends on the day, even though the city is different. 
       
The first hypothesis is fully accepted.
       
2. Music preferences are not that significant on every week in Springfield and Shelbyville. There is a little different order on Monday, but: both in Springfield and Shelbyville, the citizen at most listen to pop music. 
       
So this hypothesis cannot be accepted. It should be borne in mind that the results might have been different was it not for the missing values.
       
3. It turns out that the user music preferences in Springfield and Shelbyville are very similar.

The third hypothesis is rejected. If there really is a difference in preference, it cannot be seen from this data.