**Introduction**

Effective business decision-making hinges on the accurate formulation and testing of hypotheses. In this project, we aim to analyze the music preferences of listeners in two cities, Springfield and Shelbyville, using real data from Y.Music. Our goal is to test several hypotheses regarding user behavior and music preferences in these cities to draw actionable insights.

**Objectives**

The primary objectives of this project are:

To determine if user activity varies by day and city.
To compare the genres listened to by residents of Springfield and Shelbyville on Monday mornings and Friday nights.
To identify any significant differences in music preferences between the two cities, specifically if Springfield residents prefer pop music and Shelbyville residents favor rap music.
Data Description

The data includes the following columns:

- userID: Unique identifier for each user
- Track: Title of the track
- artist: Name of the artist
- genre: Genre of the music
- City: City where the user is located
- time: Timestamp when the track was played
- Day: Day of the week when the track was played

Hypotheses

User activity varies depending on the day and city.
On Monday mornings, residents of Springfield and Shelbyville listen to different genres. The same applies to Friday nights.
Listeners in Springfield and Shelbyville have different preferences. Specifically, Springfield users are inclined towards pop music, while Shelbyville users prefer rap music.
Methodology

To achieve the project objectives, we will perform the following steps:

1. Data Cleaning and Preparation: Ensure the dataset is clean, with no missing or inconsistent entries.
2. Descriptive Analysis: Provide an overview of the dataset, including user distribution across cities, popular genres, and overall listening habits.
3. Hypothesis Testing: Use statistical methods to test the hypotheses. This includes:
  - Analyzing user activity patterns across different days and cities.
  - Comparing genre preferences on Monday mornings and Friday nights between the two cities.
 - Evaluating overall genre preferences in Springfield and Shelbyville.
4. Conclusion and Recommendations: Summarize the findings and provide recommendations based on the analysis.

Expected Outcomes

Upon completion of this project, we expect to have a clear understanding of:

1. How user activity varies by day and city.
2. Whether there are distinct genre preferences on specific days and times between Springfield and Shelbyville.
3. The overall music preferences in each city and how they differ.
4. This analysis will help in making data-driven decisions for targeted marketing, content creation, and improving user engagement based on city-specific music preferences.





Data **Loadment**

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('/content/music_project_en.csv')

In [4]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [5]:
df.describe()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


In [6]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [7]:
df = df.rename(columns={
    '  userID':'user_id',
    'Track':'track',
    '  City  ':'city',
    'Day':'day'
})

In [8]:
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

In [9]:
df.isna().sum().sort_values(ascending=False)

artist     7567
track      1343
genre      1198
user_id       0
city          0
time          0
day           0
dtype: int64

In [10]:
column_to_replace = ['track','artist','genre']
for value in column_to_replace :
    df[value] = df[value].fillna('unknown')

In [11]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

In [12]:
df.duplicated().sum()

3826

In [13]:
df = df.drop_duplicates().reset_index(drop=True)

In [14]:
df.duplicated().sum()

0

In [15]:
df['genre'].sort_values(ascending=True).unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

In [16]:
def replace_wrong_genres(series,wrong_genres,correct_genres):
    series = series.replace(wrong_genres,correct_genres)
    return series

In [17]:
wrong_genres = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'

df['genre'] = replace_wrong_genres(df['genre'], wrong_genres, correct_genre)

In [18]:
df['genre'].sort_values(ascending=True).unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

We have detected three issues in our data:

1. Incorrect title formatting
2. Missing values
3. Explicit and implicit duplicates

The column names have been cleaned to facilitate table processing. All missing values have been replaced with 'unknown'. However, we still need to see if the missing values in the 'genre' column will affect our calculations.

The absence of duplicates will make our results more accurate and easier to understand.

**Testing Hypothesis**

**Hyphothesis 1**

In [19]:
df.groupby('city')['track'].count().reset_index()

Unnamed: 0,city,track
0,Shelbyville,18512
1,Springfield,42741


In [20]:
df.groupby('day')['track'].count().reset_index()

Unnamed: 0,day,track
0,Friday,21840
1,Monday,21354
2,Wednesday,18059



Create a function number_tracks() to count the number of tracks played for a given day and city.

In [21]:
def number_tracks (day, city) :
    track_list = df.loc[(df['day']==day) & (df['city']==city)]['user_id'].count()
    return track_list


In [37]:
spr_mon = number_tracks('Monday','Springfield')
shel_mon = number_tracks('Monday','Shelbyville')
spr_wed = number_tracks('Wednesday','Springfield')
shel_wed = number_tracks('Wednesday','Shelbyville')
spr_fri = number_tracks('Friday','Springfield')
shel_fri = number_tracks('Friday','Shelbyville')

In [38]:
col=['city', 'monday', 'wednesday', 'friday']
res=[
    ['Springfield', spr_mon, spr_wed, spr_fri],
    ['Shelbyville', shel_mon, shel_wed, shel_fri]
]

res_table=pd.DataFrame(data=res, columns=col)
res_table

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusion**

1. In Springfield, the number of tracks played peaks on Monday and Friday, while there is a decline in activity on Wednesday.
2. In Shelbyville, conversely, users listen to music more on Wednesday, with less user activity on Monday and Friday.








**Hyphotesis 2**

In [29]:
spr_general = df[(df['city'] == 'Springfield')]
spr_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
...,...,...,...,...,...,...,...
61247,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Springfield,21:07:12,Monday
61248,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
61250,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
61251,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [30]:
shel_general = df[(df['city'] == 'Shelbyville')]
shel_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Shelbyville,21:20:49,Wednesday
...,...,...,...,...,...,...,...
61239,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Shelbyville,21:14:40,Monday
61240,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Shelbyville,21:06:50,Monday
61241,29E04611,Bre Petrunko,Perunika Trio,world,Shelbyville,13:56:00,Monday
61242,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Shelbyville,09:22:13,Monday



Function genre_weekday() has been created with four parameters:

1. A data table
2. The name of the day
3. A start timestamp in the format 'hh'
4. An end timestamp in the format 'hh'

The function should return information about the 15 most popular genres on a specific day within the period between the two timestamps.

In [32]:
def genre_weekday(df, day, time1, time2):

    # Sequential filtering
    # genre_df will only retain rows from df where the 'day' is equal to the given day
    genre_df = df[df['day'] == day]

    # genre_df will only retain rows from df where the 'time' is less than time2
    genre_df = genre_df[genre_df['time'] < time2]

    # genre_df will only retain rows from df where the 'time' is greater than time1
    genre_df = genre_df[genre_df['time'] > time1]

    # Group the filtered DataFrame by the 'genre' column, take the 'user_id' column, and count the number of rows for each genre using the count() method
    genre_df_count = genre_df.groupby('genre')['user_id'].count()

    # Sort the results in descending order (so the most popular genre is displayed first in the Series object)
    genre_df_sorted = genre_df_count.sort_values(ascending=False)

    # Return a Series object containing the 15 most popular genres on a given day during a specified time period
    return genre_df_sorted[:15]


In [33]:
genre_weekday(spr_general,'Monday','07:00:00','11:00:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: user_id, dtype: int64

In [34]:
genre_weekday(shel_general,'Monday','07:00:00','11:00:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: user_id, dtype: int64

In [35]:
genre_weekday(spr_general,'Friday','17:00:00','23:00:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: user_id, dtype: int64

In [36]:
genre_weekday(shel_general,'Friday','17:00:00','23:00:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: user_id, dtype: int64

**Conclusion**

After comparing the top 15 genres on Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to the same genres. The top five genres are the same in both cities, with only rock and electronic switching places.
2. In Springfield, the number of missing values is very high, so 'unknown' appears in the 10th position. This indicates that missing values constitute a significant portion of the data, which could affect the reliability of our conclusions.

For Friday night, the situation is similar. Individual genres vary, but overall, the top 15 genres are the same for both cities.

Therefore, the second hypothesis is partially confirmed:

1. Users listen to the same music at the beginning and end of the week.
2. There is no significant difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the significance of the number of missing values calls these results into question. In Springfield, the large amount of missing data impacts our top 15 genres. Without these missing values, the results might be different.








**Hyphotesis 3**

In [39]:
spr_genres = spr_general.groupby(['genre'])['track'].count().sort_values(ascending=False)

In [40]:
spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: track, dtype: int64

In [41]:
shel_genres = shel_general.groupby(['genre'])['track'].count().sort_values(ascending=False)

In [42]:
shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: track, dtype: int64

**Conclusion**

Pop music is the most popular genre in Springfield, as we anticipated. However, pop music is equally popular in both Springfield and Shelbyville, and rap music does not make it into the top 5 genres for either city.

**Findings**

We tested the following three hypotheses:

1. User activity varies depending on the day and city.
2. On Monday morning, residents of Springfield and Shelbyville listen to different genres. This also applies to Friday night.
3. Listeners in Springfield and Shelbyville have different preferences. In both cities, users prefer pop music.

After analyzing the available data, we can conclude that:

1. User activity in Springfield and Shelbyville varies depending on the day of the week, although the two cities vary in various ways. The first hypothesis can be fully accepted.

2. Music preferences do not vary significantly throughout the week in Springfield and Shelbyville. We can see minor differences in ranking on Monday, but users in both Springfield and Shelbyville mostly listen to pop music.
Therefore, this hypothesis cannot be accepted. It is also important to note that the results obtained could differ if we did not have missing values.

3. It turns out that the music preferences of users from Springfield and Shelbyville are very similar.
The third hypothesis is rejected. If there were differences in preferences, unfortunately, we cannot know them from this data.