## Data Wrangling

## Content
1. Data Collection
2. Consolidating multiple data file into single data set
3. Fill the missing values

### 1. Data Collection
Data collection was done in two steps


** A. Get Intial Match Summary Data **

I had written a site scrapping script to get data from https://www.premierleague.com/ to get data from 2011-12 season to 2018-2019 season. Output of site scrapping script was json file for individual match. Initial json file gave summary information about the match with data attributes like
1. Match ID
2. Game week
3. Date of the match
4. Team Playing in the match (Home and Away)
5. Attendence in the match
6. Final Score
7. Match outcome (Home Win, Away Win or Draw)

** B. Get the detailed statistics for every match **

Wrote a seperate python script to get additional statistics for match
1. Formation used
2. Total Passes
3. Total distance covered by team
4. Total clearance
5. Touches made
6. Possession
7. Corner taken
8. Chances created
9. Yellow cards
10. Goals by player role
11. Player name and profile

Python script <a href='https://github.com/vikashya/SA/blob/master/Capstone-Project-1/scripts/data_extraction_premier_league.py'>data_extraction_premier_league.py</a> is used for get additional stat mentioned above. Script will combine data for individual match and create data for single season

### 2. Consolidating multiple data file into single data set

Data extracted in Step1 was for individual season. To perform the exploratory data analysis we need to consolidate all individual season detail which is in json into one single file. For purpose of project the consolidated file was created in tsv format

Below piece of code takes json data for each season and create tsv file

In [5]:
##import json
##import csv
##base_dir= 'data/2018-19/'
##input_file_name = base_dir+ '2018-19-premier_league_data.json'
##output_file_name = base_dir+'2018-19-premier_league_data.tsv'

##with open(input_file_name) as json_file:
##  j = json.load(json_file)
##tsv_file = open(output_file_name, "w")
##tsv_writer = csv.writer(tsv_file, delimiter='\t')   
##tsv_writer.writerow(j['0'][0].keys())
##for key,values in j.items():
##    tsv_writer.writerow(values[0].values())

### 3. Fill the missing values

After looking at the data which is extracted we see the data is pretty clean and needs no further cleansing. This would just need filling the missing value which is needed for our exploratory data analysis

** A. Import the required libraries **

In [9]:
import pandas as pd
import chardet
import collections
import matplotlib.pyplot as plt
import numpy as np
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 800)

**B. Load the data **

In [10]:
with open('data/premier_league_data_consolidated.tsv', 'rb') as f:
    result = chardet.detect(f.read()) 

**C. Fill the missing value in dataframe **

In [13]:
df_original=pd.read_csv('data/premier_league_data_consolidated.tsv',delimiter='\t',encoding = result['encoding'])
df_completeset_converted = df_original
df_completeset_converted['tot_offside_a']=df_completeset_converted['tot_offside_a'].fillna(0)
df_completeset_converted['tot_offside_h']=df_completeset_converted['tot_offside_h'].fillna(0)
df_completeset_converted['shot_off_target_h']=df_completeset_converted['shot_off_target_h'].fillna(0)
df_completeset_converted['shot_off_target_a']=df_completeset_converted['shot_off_target_a'].fillna(0)
df_completeset_converted['corner_taken_h']=df_completeset_converted['corner_taken_h'].fillna(0)
df_completeset_converted['corner_taken_a']=df_completeset_converted['corner_taken_a'].fillna(0)
df_completeset_converted['interception_h']=df_completeset_converted['interception_h'].fillna(0)
df_completeset_converted['interception_a']=df_completeset_converted['interception_a'].fillna(0)
df_completeset_converted['first_half_goals_h']=df_completeset_converted['first_half_goals_h'].fillna(0)
df_completeset_converted['first_half_goals_a']=df_completeset_converted['first_half_goals_a'].fillna(0)
df_completeset_converted['goals_openplay_h']=df_completeset_converted['goals_openplay_h'].fillna(0)
df_completeset_converted['goals_openplay_a']=df_completeset_converted['goals_openplay_a'].fillna(0)
df_completeset_converted['big_chance_created_h']=df_completeset_converted['big_chance_created_h'].fillna(0)
df_completeset_converted['big_chance_created_a']=df_completeset_converted['big_chance_created_a'].fillna(0)
df_completeset_converted['big_chance_scored_h']=df_completeset_converted['big_chance_scored_h'].fillna(0)
df_completeset_converted['big_chance_scored_a']=df_completeset_converted['big_chance_scored_a'].fillna(0)
df_completeset_converted['big_chance_missed_h']=df_completeset_converted['big_chance_missed_h'].fillna(0)
df_completeset_converted['big_chance_missed_a']=df_completeset_converted['big_chance_missed_a'].fillna(0)
df_completeset_converted['own_goal_accrued_h']=df_completeset_converted['own_goal_accrued_h'].fillna(0)
df_completeset_converted['own_goal_accrued_a']=df_completeset_converted['own_goal_accrued_a'].fillna(0)
df_completeset_converted['saves_h']=df_completeset_converted['saves_h'].fillna(0)
df_completeset_converted['saves_a']=df_completeset_converted['saves_a'].fillna(0)
df_completeset_converted['tot_yel_card_h']=df_completeset_converted['tot_yel_card_h'].fillna(0)
df_completeset_converted['tot_yel_card_a']=df_completeset_converted['tot_yel_card_a'].fillna(0)
df_completeset_converted['forward_goals_h']=df_completeset_converted['forward_goals_h'].fillna(0)
df_completeset_converted['forward_goals_a']=df_completeset_converted['forward_goals_a'].fillna(0)
df_completeset_converted['defender_goals_h']=df_completeset_converted['defender_goals_h'].fillna(0)
df_completeset_converted['defender_goals_a']=df_completeset_converted['defender_goals_a'].fillna(0)
df_completeset_converted['midfielder_goals_h']=df_completeset_converted['midfielder_goals_h'].fillna(0)
df_completeset_converted['midfielder_goals_a']=df_completeset_converted['midfielder_goals_a'].fillna(0)
df_completeset_converted['subs_made_h']=df_completeset_converted['subs_made_h'].fillna(0)
df_completeset_converted['subs_made_a']=df_completeset_converted['subs_made_a'].fillna(0)

We will use this data set in our subsequent Exploratory data analysis