# What is Exploratory Data Analysis [ Vikram] 

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

It could be broadly classified as a combination of

    *Descriptive Statistics ( Which describes the Data Itself)
    
    *Inferential Statistics (What can be inferred using the data)
    
    * Visualizations of the data 
    
    


### What are we doing here

Considering the start of Wimbledon we have picked up a data set consisting of Tennis Matces data and with this data set we are trying to analyze the different relationships in the data set comparing it with various factors

    * Does the points won in the first server matter a lot for wins
    * How does unforced errors impact the wins
    * How does double faults impact the game
    * Does serving fast win a player more points
    
From the data available on the web are we able to infer something out of the data?

There are infinite possibilities on exploring the data for the sake of this presentation we will concentrate on some of the factors mentioned above



## Descriptive Statistics

### Importing the Libraries  

In [None]:
import numpy as np
import pandas as pd 
from scipy import stats as st
from quickda.explore_categoric import *
from quickda.explore_data import *
from quickda.clean_data import *
from quickda.explore_numeric import *
from quickda.explore_numeric_categoric import *
from quickda.explore_time_series import *
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats import weightstats as stest
from statsmodels.formula.api import ols # buiding model
from statsmodels.stats.anova import anova_lm # actual hyp test
import statsmodels.stats.multicomp as mc
import random

# Vikram

### Let us Read the Data to a Pandas Data Frame and sneak into the data

In [None]:
df = pd.read_csv(r'C:\Users\1026774\Desktop\M.Tech\Statistics\Presentation\Tennis Data.csv')

In [None]:
df.head() # Display the first 10 rows of the data

In [None]:
df.shape # Get the total observations in the data
# Looking at the data we can see that there are 86394 rows and 39 columns

### Standardize Data 

The above data set has irregular column names so let us standardize column names

In [None]:
df1 = clean(data = df, method='standardize') # With this we standaridze column names for the ease of usage
df1.head(2)

### Summarize the Data

In [None]:
df_sum = explore(data=df1, method="summarize")
df_sum.T

Looking at the above we can see that the data set has many null values and the count of null values is
shown in null_sum column. 

Also we can the the distribution of some columns like aces, aces in first server is right skewed, average first and second serve are right skewed

Now let us see if we can move forward with cleaning the data

### Remove duplicate rows

In [None]:
df2 =clean(data=df1, method="duplicates")
df2.head(2)
# With the above command we are removing duplicate rows if any in the data set

###  Replace value with nulls/new value

Since certain columns contain null values, it is better to replace the null values. 

In [None]:
df3 = clean(data=df2, method="replaceval", 
      columns=[], to_replace="", value=np.nan)
df3.head(2)

### Fill Missing Values

In [None]:
df4 = clean(data=df3, method="fillmissing")
df4.head(2)

# This method uses pandas interpolation 

### Drop Missing Values

For the sake of this analysis we decided to drop the missing values 

In [None]:
df5 = clean(data = df4, method='dropmissing')


### Profile the cleaned Data

In [None]:
profile = explore(data = df5, method='profile', report_name='Tennis_Data_Report')

In [None]:
profile

### Correlation Matrix

In [None]:
eda_num(data = df5, method='correlation')

From the above we can identify the correlation between columns

Some citations 
    * points_played_2nd_serve is highly correlated with points won indicating that there is a dire need to win the point do move forward in the game
    * Unforced errors leads to points loss. 
 

### Explore Categoric Features


In [None]:
eda_cat(data=df5, x='gender', y = 'aces', method='default')

Looking at the above, we can identify males serve more number of aces than females 

In [None]:
eda_cat(data=df5, x='gender', y = 'double_faults', method='default')

Looking at the above we are able to understand that females fare better in the first serves 

### Predictive Power Score

In [None]:
eda_numcat(df5, x="unforced_errors", method='pps')

From the above we can see unforced error has a direct impact on the points lost.
So if a player is able to control unforced error he/she will have a better advantage in the game

It is interesting to know that the aces served during years remain almost similar across years 

### Visualizations 

#### Visualization of Variables

In [None]:
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()

In [None]:
df_vis = df5.to_csv(r'C:\Users\1026774\Desktop\M.Tech\Statistics\Presentation\clean_data.csv')

In [None]:
df_vis = AV.AutoViz(r'C:\Users\1026774\Desktop\M.Tech\Statistics\Presentation\clean_data.csv',verbose=2)

#### Visualizations of the Data Set

##### Tournament Aces [ Ravi]

In [None]:
sns.barplot(data = df5, x = 'tournament', y = 'aces' )

Its interesting to find out that of all the grand slams the number of aces served in Wimbledon is the highest

##### Unforced Errors at player level [Krushna]

In [None]:
sns.barplot(data = df5, x = 'seed', y = 'unforced_errors', hue = 'handed')

Non seeded players commit more unforced errors than the seeded players
Also the playing hand of the person does not have any impact on the unforced errors

##### Aces in Break Points [Mallkarjun]

In [None]:
sns.barplot(data = df5, x = 'aces_1st_serve', y = 'breakpoints_won', hue = 'gender')

Looking at the above we can infer women fare better in serving aces when crucial break points are concerned. 

##### Who Serves the fastest [Nitheen]

In [None]:
sns.barplot( data = df5 , x = 'gender', y = 'fastest_1st_serve_mph')

Looking at above we can infer men serve the fastest compared to women 

##### Who receives second serves better [ Vikram]

In [None]:
sns.barplot( data = df5, x = 'return_points_won_2nd_serve', y = 'gender')

The above graph indicates women return second serves better than men

## Inferential Statistics

For this presentation on Inferential Statistics we will be using only the Wimbledon matches in the dataset

In [None]:
x= np.where((df5.tournament=='Wimbledon')) # Select only wimbledon matches to a Dataframe

In [None]:
df_wim= df5.loc[x] # Name the dataframe as Wimbledone Data Frame

In [None]:
df_wim.head(2)

### Rolex claims Average Serving speed of Roger Federer is more than 148 mph [Nitheen /Vikram]

HO : Average Serving speed < = 148 mph

H1 : Average serving speed > 148 mph


In [None]:
rf = df_wim.loc[df_wim.player=='Roger FEDERER']


In [None]:
seed = 1
sample_rf = rf.average_1st_serve_mph.sample(n = 50)
# Selecting the same 50 samples for the data set for analysis

In [None]:
sample_rf.mean()
# Obtaining the sample mean

Since variance is unknown using t_stat and since alpha is not given will be using 95 %. Invoking the CLT and assuming the distribution to be normal since n is > 30

In [None]:
t_val_val = st.norm.isf(0.05)
t_val

Since this is right tailed test we will reject the null hypothesis if t_stat is greater than 1.67 and p value > 0.05

In [None]:
st.ttest_1samp(sample_rf, popmean=148)

Since the test stat is < 1.67 and p value is less than 0.05 we fail to accept the null hypothesis and conclude that Roger serves more than 148 mph 

### Do men serve more double faults than women [ Ravi]

In [None]:
eda_cat(data = df_wim,x = 'gender',y = 'double_faults' )

The above is clearly indicating that men make more double faults than women

### How does gender fare across Service winners [ KK]

In [None]:
eda_cat(data = df_wim,x = 'gender',y = 'service_winners' )

When it comes to service winners, men fare better

### Who conquers more break points [ Mallikarjun]

In [None]:
sns.barplot(data = df_wim, x = 'breakpoints_played', y = 'breakpoints_won',hue='gender')

Its interesting once again to note that women conquer more break points than men (meaning women handle breaks very well :)

### Hypothesis Testing [ Vikram]



Ho : Winning points in the first/second serve impacts the match result

H1 : Winning points in the first/second serve does not impact the match result

Since sigma is unknown will be using the t test at 95 % Confidence Level


Rationale  : 

The reason for the advantage, if it exists, would be that the player who receives in the first game is 

usually one game behind and that this would create extra stress. Let us investigate whether there is 

any truth in this hypothesis.

In [None]:
x = df_wim.points_played_1st_serve
y = df_wim.points_won_1st_serve
xbar1 = (y/x).mean() # Percentage of wins from first serve

In [None]:
xbar1

In [None]:
x1 = df_wim.points_played_2nd_serve
y1 = df_wim.points_won_2nd_serve
xbar2 = (y1/x1).mean() # Percentage of wins from second serve

In [None]:
std1 = (y/x).std() # Std dev of 1 server wins

In [None]:
std2 = (y1/x1).std() # Std dev of 2nd serve wins

In [None]:
len1 = len(df_wim.points_played_1st_serve) # Len of observations
len2 = len(df_wim.points_played_2nd_serve)


Now let us analyze this statistically 

In [None]:
st.ttest_ind_from_stats(xbar1,std1,len1,xbar2,std2,len2) 

Since p value is 0 we fail to accept the null hypothesis and conclude that the winning does not have any significance on winning points in the first or second serve 

In [None]:
# Function to calculate % of first serve wins vs match wins in wimbledon across genders
def points_in_first_serve(df_wim):
    x = df_wim.points_played_1st_serve
    y = df_wim.points_won_1st_serve
    percent = y/x
    return(sns.barplot(data = df_wim,x = 'points_won_1st_serve', y = 'match_won/lost',
                       hue = 'set'))

    
    

In [None]:
points_in_first_serve(df_wim)

In [None]:
def points_in_second_serve(df_wim):
    x = df_wim.points_played_2nd_serve
    y = df_wim.points_won_2nd_serve
    return(sns.barplot(data = df_wim,x = 'points_won_2nd_serve', y = 'match_won/lost', hue = 'set'))

In [None]:
points_in_second_serve(df_wim)

### Does player hand have significance on winning [ Ravi/KK]

H0 : Playing hand does not impact match winnings

H1 : Playing hand impacts match wins

In [None]:
right_handed = (df_wim.handed=='Right Handed').value_counts() # Get the counts of right handed players

In [None]:
right_handed_wins = ((df_wim.handed=='Right Handed')&(df_wim['match_won/lost']=='Won')).value_counts() 
# Get the count of right hand wins

In [None]:
rhw_ratio = right_handed_wins/right_handed
print(rhw_ratio)
# Compute the ratio of right hand players vs wins

In [None]:
left_handed = (df_wim.handed=='Left Handed').value_counts() # Get the counts of right handed players
left_handed_wins = ((df_wim.handed=='Left Handed')&(df_wim['match_won/lost']=='Won')).value_counts()
 # Get the counts of left handed players
# Get the count of left hand wins

In [None]:
lhw_ratio = left_handed_wins/left_handed
print(lhw_ratio)
# Compute ratio of right hand players vs wins

In [None]:
st.chi.isf(0.05, df = 1)

# We reject the null hypothesis if the test stat is > 1.95

In [None]:
stat,p_val, dof,exp_val  = st.chi2_contingency([[np.array(rhw_ratio)],[np.array(lhw_ratio)]])

In [None]:
stat,p_val, dof,exp_val

Since the P Value is greater than 0.05 we fail to reject the null hypothesis and conclude the playing hand does not matter in winning or losing the games

In [None]:
eda_cat(data = df_wim, x = 'handed', y = 'match_won/lost')

## Simulations [ Nitheen, Mallikarjun ] 

With the existing Data Set

### Central Limit Theorem - Demo

    *The Central Limit Theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal, if the sample size is large enough.
    
    
    *In other words, if we take enough random samples that are big enough, the proportions of all the samples will be normally distributed around the actual proportion of the population. Note that the underlying sample distribution does not have to be normally distributed for the CLT to apply. To break this down even further, imagine collecting a sample and calculating the sample mean. Repeat this over and over again, collecting a new, independent sample from the population each time. If we plotted a histogram of each sample mean, the distribution will be normally distributed.

In [None]:
st.shapiro(df_wim.aces)
# P Value is < 0.05 and we can conclude that the data is not normally distributed

In [None]:
sns.distplot(df_wim.aces)
# Distribution plot for above

The above distribution for aces is Right Skewed 

In [None]:
seed = 2
samples_2 = np.array(df_wim.aces)
samples_2 = [np.mean(random.choices(samples, k = 30)) for _ in range(20)]
sns.distplot(samples_2)

In [None]:
st.shapiro(samples_2)
# P Value greater than 0.05 and the shape of the curve is nearing a normally distributed curve

In [None]:
seed = 2
samples_2 = np.array(df_wim.aces)
samples_2 = [np.mean(random.choices(samples, k = 40)) for _ in range(30)]
sns.distplot(samples_2) 


In [None]:
st.shapiro(samples_2)

# P Value is greater than 0.05 and the data looks nearly normally distributed

In [None]:
seed = 2
samples_500 = np.array(df_wim.aces)
samples_500 = [np.mean(random.choices(samples, k = 50)) for _ in range(500)]
sns.distplot(samples_500)

Distribution for 50 samples run 500 times, the distribution looks almost normal

In [None]:
st.shapiro(samples_500)
# P Value is greater than 0.05 and the curve almost looks like  normally distributed

## Further Scope for Analysis of the Data [ Vikram]

The vastness of the data set opens scope to investigate few more hypothesis statements viz

    * In the final set the player who has won the previous set has the advantage
    * The real champions play their best tennis at the big points;
    * All points are equally important
    * After missing break points in the previous game there is an increased chance that you will loose your own service
    * One break is enough to win the set

#### What went well [ KK]

The project/presentation provided an opportunity to use some of the concepts which were learnt as part of the curriculum.

This also provided roads to explore many of the scipy stats modules and how the outputs can be used to draw meaningful inferences. 



#### What could have been done better [ Ravi, Mallikarjun, Nitheen]

The variables in the data set were not aligning to a normal distribution and shapiro and levenes test were failing so there existed some uncertainity if the same data set can be used to perform Anova tests

The data set provides and opportunity to explore some questions like

    * How does gender fare in terms of fastest server to matches won (Meaning does serving fast help win more points which in turn help win matches?
    * Do seeded players experiment more varieties in terms of shots against non seeded players and due to which they make more errors?
    * How can simulations be used to the current data set?

Further deeper understanding and research is needed to answer some of the above questions