In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**As a one who loves all different kinds of music, it is interestig to analyze music data realted to one of the best music platforms like "Spotify", if not the best one**

I hope you like it, and if it is a good one please upvote ^_^
![](https://storage.googleapis.com/pr-newsroom-wp/1/2020/03/Header.png)

# 1- Loading data

In [None]:
df = pd.read_csv('../input/top50spotify2019/top50.csv', encoding = 'ISO-8859-1',index_col=0)

In [None]:
df.head(5)

> We need to understand the features of any song.

1. *Danceability*: Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.

2. *Valence*: Describes the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

3. *Energy*: Represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.

4. *Tempo*: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece, and derives directly from the average beat duration.

5. *Loudness*: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.

6. *Speechiness*: This detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.

7. *Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.

etc.....

Check that link: https://towardsdatascience.com/what-makes-a-song-likeable-dbfdb7abe404#:~:text=Valence%3A%20Describes%20the%20musical%20positiveness,measure%20of%20intensity%20and%20activity.

# 2 -Information about data

In [None]:
df.info()

***We can see clearly no missing data in our dataframe***

# 3- Sanity Check

## 3.1- Check for duplicates

In [None]:
sum(df.duplicated())

## 3.2- Check for missing data

In [None]:
df.isnull().any().sum()

## 3.3- Columns data-type

***We can see clearly all columns in the data frame are stored using the correct data-type***

In [None]:
#df.info()

# 4- Data Analysis

1. What is the distribuiton of genres ?
2. Is there a relation between popularity and (length,speechiness,acousticness) ?

In [None]:
df.Genre.value_counts().sort_values(ascending=True).plot(kind = 'barh',
                                                        title = 'Genere distribuition', figsize = (8,6))

In [None]:
df.columns

In [None]:
df[['Beats.Per.Minute', 'Energy',
       'Danceability', 'Loudness..dB..', 'Liveness', 'Valence.', 'Length.',
       'Acousticness..', 'Speechiness.', 'Popularity']].corr()

We can see from the correlation table, that there is no match between popularity and other song features, but we can also notice there is matching between *Energy* & *Loudness*.

In [None]:
df['Artist.Name'].value_counts().plot(kind = 'pie',
                                                title = 'Popular Artists', figsize = (20,15), layout=(10,3))

# 5- Data cleaning

**The columns names can be changed to be in a better form**

In [None]:
for column in df.columns:
    df = df.rename(columns={column: column.lower()})

In [None]:
df

In [None]:
for column in df.columns:
    df = df.rename(columns={column: column.lower()})
    df = df.rename(columns={column: column.replace('.', '_')})

In [None]:
df

In [None]:
df = df.rename(columns={'beats_per_minute': 'tempo'})

In [None]:
df.columns

In [None]:
mod_col = ['valence_', 'length_', 'acousticness__',
       'speechiness_']

for i in mod_col:
    df = df.rename(columns={i: i.replace('_', '')})

In [None]:
df = df.rename(columns= {'loudness__db__':'loudness_db'})

In [None]:
df

In [None]:
df.to_csv('Spotify.zip',index=False)