# Sprint 1

## Podcast listeners client base demographic Segmentation

### Motivation
Podcast space is a growing media space. Publishers and advertising agencies have been having a challenge in identifying different segments of listeners within this space. This might be due to the fact the industry is new and data around this industry is scattered and challenging to put together to make a picture of its audience.

### Result
Here we present a research study on podcast listeners data using advanced analytical methods to identify listeners segments. Take Netflix as an example, they are able to segment their user base in a way they can recommend existing content to them or identify what kind of new content to invest in and produce. For us here we attempt to provide something similar. We will provide meaningful insights to help publishers and advertising agencies reach their target audience.

### Research question
What are the various podcast listeners characteristic, trends and demographics segments in the Canadian podcast space?

### Sprint 1
In this sprint we are going to be focusing on exploring our data, conducting data cleaning and getting data to be more readable and ready for exploration

### What is needed
We are going to be using Python as a programing language. We are going to be using the following libararies in Sprint 1:
1. Pandas
2. Numpy

### Summary of Findings
Upon first exploring the data we found that data headers need to be mapped to a dictionary. This was challenging as our raw data set is composed of over 500 columns. So, we had to figure out a way to map both data sets with a function. We used a mapping function to rename all the columns on our raw data set so column names would become more readable.
Second, as we mentioned above, the data is composed of over 500 columns, so we had to do some understanding of the data set and drop columns. Here is what we did:
1.	We dropped all the columns that contain no data what is so ever. This was done by replacing white space with NaN values then dropping all columns of NaNs. With that we were left with 400 columns. Still a lot of columns and a lot of potential of unnecessary data.
2.	We looked at % of NaN on the 400 columns left and then we dropped all columns that contain 50% or less NaN values. This left us with 184 columns. This is more manageable.
3.	After dealing with data set for a period now, we can tell the data is huge and in a very messy complicated format. Data cleaning part is going to be taking longer than expected.
4.	We need to refine our columns mapping from the dictionary to our data set as dictionary column names are split over unknown number of rows which poses a challenge. Either map the 184 data columns manually or figure out a more sophisticated function to map data properly.


In [58]:
import pandas as pd
import numpy as np

# read data to panads 
data = pd.read_csv("C:\\users\\omarh\\downloads\\CSDA 1050 Capstone\\Project\\2018 Data from Jeff\\podcast18.csv") 

# read and convert data schema mapping into pandas
data_schema = pd.read_fwf("C:\\users\\omarh\\downloads\\CSDA 1050 Capstone\\Project\\2018 Data from Jeff\\Dictionary AI058 CdnPodcast 2018 FINALJun16 CLIENT.txt", header = 2)

#dict of mapping drived from data schema file
mapping = {}
# getting unique mapping of varible and lable
data_schema_unique = data_schema.groupby(['Variable', 'Label']).size().reset_index(name='Freq')
# use mapping dict to create a key value pair with varible as key and lable as value
for index, row in data_schema_unique.iterrows():
    if row['Label']:
        mapping[row['Variable']] = row['Label']

# use the mapping dict to rename the varibles of data file with our newly created shchema
data.rename(columns=mapping,inplace=True)

data.head(3);

In [49]:
data_schema.head(10)

Unnamed: 0,Variable,Unnamed: 1,Position,Label,Measurement,Level,Role,Column,Width,Alignment,Print,Format,Write,Format.1
0,study,,1.0,STUDY IDENTIFICATION,Nominal,,Input,26.0,,Left,A75,,A75,
1,id,,2.0,RESPONDENT ID,Scale,,Input,10.0,,Right,F8,,F8,
2,,,,(99999999),,,,,,,,,,
3,twave,,3.0,WAVE (DEFAULT 1),Nominal,,Input,10.0,,Right,F8,,F8,
4,arfrr,,4.0,arfrr .REGION ROLLUP,Nominal,,Input,10.0,,Right,F1,,F1,
5,arfpr,,5.0,arfpr .PROVINCE,Nominal,,Input,10.0,,Right,F2,,F2,
6,,,,ROLLUP,,,,,,,,,,
7,arfgen,,6.0,arfgen .GENDER,Nominal,,Input,10.0,,Right,F1,,F1,
8,arfagerf,,7.0,arfagerf .AGE ROLLUP,Ordinal,,Input,10.0,,Right,F1,,F1,
9,,,,FINE,,,,,,,,,,


In [185]:
#data.groupby(['arfpr .PROVINCE','arfgen .GENDER','qs2g .FREQUENCY-']).size().reset_index(name='counts');

In [146]:
#data transformation test
#pd.get_dummies(data, columns=['qs2g .FREQUENCY-']);

In [44]:
data.describe()

Unnamed: 0,RESPONDENT ID,WAVE (DEFAULT 1),qp10aa .PROPORTION-,qp10ab .PROPORTION-,qp10ac .PROPORTION-,qp10ad .PROPORTION-,qp10ae .PROPORTION-,qp10af .PROPORTION-,qe1ba .%BY MYSELF,qe1bb .%WITH OTHERS,...,qe9ba .PROPORTION-,qe9bb .PROPORTION-,qe9bc .PROPORTION-,qe9bd .PROPORTION-,qm4a .INTEREST1 IN,QA1_Advertising_Podcast_1_Code,<none>,<none>.1,<none>.2,FINAL WEIGHT VARIABLE
count,1534.0,1534.0,1534.0,1534.0,1534.0,1534.0,1534.0,1534.0,1534.0,1534.0,...,1534.0,1534.0,1534.0,1534.0,1534.0,1534.0,1534.0,1534.0,1534.0,1534.0
mean,5764778.0,201806.0,40.881356,48.54824,5.188396,1.526728,3.85528,100.0,88.761408,11.238592,...,30.422425,20.563233,49.014342,100.0,10001.0,561.456975,1.0,332.187744,450.526076,1.0
std,61244.86,0.0,36.663634,38.015034,13.830739,6.710703,12.003584,0.0,24.403312,24.403312,...,35.732465,27.665689,40.861864,0.0,0.0,477.692169,0.0,446.95809,478.86102,0.7194
min,5691168.0,201806.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,...,0.0,0.0,0.0,100.0,10001.0,1.0,1.0,3.0,1.0,0.220207
25%,5701247.0,201806.0,1.0,10.0,0.0,0.0,0.0,100.0,91.25,0.0,...,0.0,0.0,2.0,100.0,10001.0,34.0,1.0,24.0,34.0,0.576557
50%,5755331.0,201806.0,33.0,50.0,0.0,0.0,0.0,100.0,100.0,0.0,...,15.0,10.0,47.0,100.0,10001.0,997.0,1.0,47.0,53.0,0.807918
75%,5834686.0,201806.0,72.0,90.0,4.0,0.0,0.0,100.0,100.0,8.75,...,53.0,30.0,100.0,100.0,10001.0,998.0,1.0,999.0,999.0,1.260848
max,5861023.0,201806.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,100.0,100.0,100.0,10001.0,999.0,1.0,999.0,999.0,7.626968


In [237]:
data.shape

(1534, 584)

In [40]:
#replace all whitespace with NaN values
data.replace(' ', np.nan, inplace=True)

In [41]:
data.dropna(axis = 1, how= 'all', inplace = True)

In [42]:
data.shape

(1534, 400)

In [46]:
#viewing columns by % of NaN values
print (data.isnull().mean() * 100)

STUDY IDENTIFICATION               0.000000
RESPONDENT ID                      0.000000
WAVE (DEFAULT 1)                   0.000000
arfrr .REGION ROLLUP               0.000000
arfpr .PROVINCE                    0.000000
arfgen .GENDER                     0.000000
arfagerf .AGE ROLLUP               0.000000
arfracea .RACE1                   58.279009
arfraceb .RACE2                   86.505867
arfracec .RACE3                   67.014342
arfraced .RACE4                   95.371578
arfracee .RACE5                   58.083442
arfracef .RACE6                   98.239896
arfraceg .RACE7                   98.891786
arfraceh .RACE8                   98.109518
arfracei .RACE9                   98.370274
arfracej .RACE10                  96.675359
arfracek .RACE11                  93.611473
arfracel .RACE12                  96.936115
arfracem .RACE13                  99.674055
arfracen .RACE14                  96.870926
arfraceo .RACE15                  97.522816
arfhhs .HOUSEHOLD               

In [19]:
data.isna().sum();

In [45]:
#print(data.isnull().mean() <= 0.5);

In [20]:
#drop columns with 50% NaN or more
data = data.loc[:, data.isnull().mean() <= .5]

In [47]:
data.shape

(1534, 184)

In [47]:
#verify
print (data.isnull().mean() * 100);