## 1. Introduction

Discogs is a comprehensive database of music records, including commercial, promotional, and bootleg or off-label records. The dataset contains 15.7 million records that help understand how the music trend was changing in time.

## 2. Objectives

In order to figure out the market of turntables and boombox, looking into the trend of vinyl and cassette sales is an idea. Those types of records would require proper devices to play. Thus, here comes the quetions to answer what the demand for the players is. 

1. How many vinyl and cassette records were released per year from 2000 to 2019?
2. Is there a correlation between vinyl records and cassette records?
3. What is the forecast for the vinyl and cassette records released through 2025?

## 3. Method

* Python programing

## 4. Prepare

- First of all, import required functions and detemine the location of the dataset to support the analysis.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
import statsmodels.formula.api as smf
from scipy import stats
from scipy.stats import linregress

# Input data files are available in the read-only "../input/" directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

- Read the .csv file into a data frame.

In [None]:
data_style = pd.read_csv('/kaggle/input/discogs-database-all-release-data/release_data_styles/release_data_styles.csv')

- Create data frames for vinyl and cassette records. The time period is set in between 2000 and 2019.

In [None]:
data_v=data_style[(data_style['format']=='Vinyl')&((data_style['year']>=2000.0)&(data_style['year']<2020.0))]
data_c=data_style[(data_style['format']=='Cassette')&((data_style['year']>=2000.0)&(data_style['year']<2020.0))]

## 5. Process

- Making sure whether there is duplicated data.

In [None]:
# See if there is duplicated data for vinyl collections
duplicates=data_v.duplicated(subset=['release_id'], keep=False)
print(data_v[duplicates].sort_values(by='release_id').head(10))

- Remove duplicated data for both data frames of vinyl and cassette records.

In [None]:
# drop duplicates for vinyl collections
data_v_dedu=data_v.drop_duplicates(subset=['release_id'], keep='first')
print(data_v_dedu.head(10))

In [None]:
# repeat the process for cassette collections
duplicates=data_c.duplicated(subset=['release_id'], keep=False)
data_c_dedu=data_c.drop_duplicates(subset=['release_id'], keep='first')
print(data_c_dedu.sort_values(by='release_id').head(10))

- Since the goal is to understand the vinyl and cassette records released in time, investigating whether any data is missing is needed. After looking into the numbers and the visualization, the data looks good right now.

In [None]:
# investigate if there is a missing value in the format column of the vinyl collections (It seems good)
print(data_v_dedu.isna().sum())
msno.matrix(data_v_dedu)
plt.show()

In [None]:
# repeat the same process for the cassette collections (It also seems good)
print(data_c_dedu.isna().sum())
msno.matrix(data_c_dedu)
plt.show()

- Collect the required info and remove unnecessary data. Combining the data frames of vinyl and cassette into a data frame.

In [None]:
data_v_sorted=data_v_dedu[['year', 'format']].groupby(by='year').count()
data_v_sorted.columns=['vinyl']
data_v_sorted=data_v_sorted.reset_index()
data_c_sorted=data_c_dedu[['year', 'format']].groupby(by='year').count()
data_c_sorted.columns=['cassette']
data_c_sorted=data_c_sorted.reset_index()

In [None]:
data_v_c=pd.merge(data_v_sorted, data_c_sorted, how='inner')
data_v_c['year']=data_v_c['year'].astype('int')
print(data_v_c)

## 6. Analysis & Visualization

- Plot the data to see the released vinyl and cassette records were changing in time from 2000 to 2019.

In [None]:
sns.set_style('dark')
g1=sns.regplot(data=data_v_c, x='year', y='vinyl', order=2, ci=95, label='Vinyl', color='blue')
g2=sns.regplot(data=data_v_c, x='year', y='cassette', order=2, ci=95, label='Cassette', color='orange')
g1.set_title('Trend of vinyl and cassette', y=1)
g1.set(xlabel='Year', ylabel='Released Qty', xticks=list(range(2000,2020)))
plt.xticks(rotation=44)
plt.legend()
sns.set_context("paper")
plt.show()

- Discovering whether there is a correlation between vinyl records and cassette records. It seems there is a high correlation between these two types of records.

In [None]:
corr_v_c=data_v_c[['vinyl', 'cassette']].corr()
print(corr_v_c)

- In order to forcast by polynomial regression, here adds a column for the sqare of years. Then, plot the regression line to know how many vinyl and cassette records may be released in the next 5 years.

In [None]:
data_v_c['year2']=data_v_c['year']**2
outcome_v=smf.ols('vinyl ~ year + year2', data=data_v_c).fit()
outcome_c=smf.ols('cassette ~ year + year2', data=data_v_c).fit()
print(outcome_v.params)
print(outcome_c.params)

In [None]:
df1=pd.DataFrame()
df1['year']=np.linspace(2000, 2025)
df1['year2']=df1['year']**2
predict_v=outcome_v.predict(df1)
plt.plot(df1['year'], predict_v, label='Vinyl Prediction', linestyle='--', color='green')
plt.plot(data_v_c['year'], data_v_c['vinyl'], 'o', alpha=1, label='Vinyl', color='blue')
df2=pd.DataFrame()
df2['year']=np.linspace(2000, 2025)
df2['year2']=df2['year']**2
predict_c=outcome_c.predict(df2)
plt.plot(df2['year'], predict_c, label='Cassette Prediction', linestyle='--', color='red')
plt.plot(data_v_c['year'], data_v_c['cassette'], 'o', alpha=1, label='Cassette', color='orange')
plt.xlabel('Year')
plt.ylabel('Released Qty')
plt.legend()
plt.show()

## 7. Conclusion

By looking into the visualization and outcome from the analysis, here are viewpoints as below. 

1. Although the released number of vinyl and cassette records went down from 2000 to 2010, the number of these records went up from 2010 until now. 
2. There is a high correlation between vinyl records and cassette records. When considering issuing products, such as turntables, for vinyl records, it makes sense to evaluate whether issuing products, such as a boombox, for cassette records is needed.
3. From the charts with polynomial regression lines, the number of vinyl and cassette records will increase. This is a trend that we can think about the market demand.

Suggested further analysis.

1. What the music genres of these records are?
2. Comparing the sales of turntables and cassette players to see whether there is a correlation between released vinyl and cassette records.
3. The market trend of digital music sales and physical records sales.