# Data Analysis with Python

<h3>By Priyanka Roy</h3>

<h3>Objectives: </h3>


<h4> Data Analysis:</h4>

- Using Python's different bulit-in libraries
- Read different types of files with Pandas Dataframe (.csv file, .xlsx file, etc.) <br>

<h4>Data Manipulation: </h4>

- Creating and naming the new data frame in Pandas
- Find the number of rows and columns in the dataframe
- Find the data statistics of the dataset
- Find the data types and missing values
- Rename the column names
- Remove unnecessary columns

<h4> Reference(s): </h4>   <br> 
[1] <a href="https://stats.espncricinfo.com/ci/content/records/93276.html"> Actual Dataset Source </a> <br>

[2] <a href="https://github.com/priyan-2020/Test_Cricket_Analysis/blob/main/wickets.csv"> wickets.csv file </a>

<h4>Import required libraries </h4>

In [1]:
import pandas as pd
import numpy as np

<h4>Reading the wickets.csv file and showing the first 10 rows of the Dataframe </h4>

In [2]:
df= pd.read_csv("wickets.csv")

df.head(10)

Unnamed: 0,Player,Span,Mat,Inns,Balls,Runs,Wkts,BBI,BBM,Ave,Econ,SR,5,10
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,9/51,16/220,22.72,2.47,55.0,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,8/71,12/128,25.41,2.65,57.4,37,10
2,JM Anderson (ENG),2003-2021,164*,304,35079,16575,623,7/42,11/71,26.6,2.83,56.3,30,3
3,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,10/74,14/149,29.65,2.69,65.9,35,8
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,8/24,10/27,21.64,2.49,51.9,29,3
5,SCJ Broad (ENG),2007-2021,149,274,29863,14590,524,8/15,11/121,27.84,2.93,56.9,18,3
6,CA Walsh (WI),1984-2001,132,242,30019,12688,519,7/37,13/55,24.44,2.53,57.8,22,3
7,DW Steyn (SA),2004-2019,93,171,18608,10077,439,7/51,11/60,22.95,3.24,42.3,26,5
8,N Kapil Dev (INDIA),1978-1994,131,227,27740,12867,434,9/83,11/146,29.64,2.78,63.9,23,2
9,HMRKB Herath (SL),1999-2018,93,170,25993,12157,433,9/127,14/184,28.07,2.8,60.0,34,9


<h4> Findings: </h4> <br>
Above this cell, the first column is the index no. and after that there are 14 columns altogether in this Dataframe, df :
 
 * <b>Player:</b> Name of the player 
 * <b>Span:</b> Duration of Test Career 
 * <b>Mat:</b> No. of matches played 
 * <b>Inns:</b> No. of innings played 
 * <b>Balls:</b> Total no. of balls bowled 
 * <b>Runs:</b> No. of runs conceded 
 * <b>Wkts:</b> No. of wickets taken altogether 
 * <b>BBI:</b> Best balling figure in an innings 
 * <b>BBM:</b> Best balling figure in a match (2 innings) 
 * <b>Ave:</b> Average meaning average no. of runs conceded per wicket 
 * <b>Econ:</b> Economy Rate, Econ= (Total runs conceded)/(Total over bowled)
 * <b>SR:</b> Strike Rate, SR means  the average no. of balls needed to bowl per wicket 
 * <b>5:</b> Shows the no. of 5-wicket wholes in an innings
 * <b>10:</b> Shows the number of times this bowler has taken ten wickets in a match 

<h4> No. of Rows and Columns </h4>

In [3]:
# No. of Rows and Columns in the dataframe, df
print("(row, column) -->", df.shape)

(row, column) --> (79, 14)


<h4>Remarks: </h4>

*   So, we can see there are 79 rows and 14 columns in the dataframe. Each row describes related information (Test Matches) about a Bowler. <br> <br>

<h4>Check for Missing Values</h4>

In [4]:
# Dataframe info, data types and checking the missing values
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Player  79 non-null     object 
 1   Span    79 non-null     object 
 2   Mat     79 non-null     object 
 3   Inns    79 non-null     int64  
 4   Balls   79 non-null     int64  
 5   Runs    79 non-null     int64  
 6   Wkts    79 non-null     int64  
 7   BBI     79 non-null     object 
 8   BBM     79 non-null     object 
 9   Ave     79 non-null     float64
 10  Econ    79 non-null     float64
 11  SR      79 non-null     float64
 12  5       79 non-null     int64  
 13  10      79 non-null     int64  
dtypes: float64(3), int64(6), object(5)
memory usage: 8.8+ KB
None


<h4>Remarks:</h4>

* We see 79 observations (Players) and 14 variables (features).
* 3 floating point datatypes- Average, Economy and Strike Rate. <br>
* 'Inns', 'Balls', 'Runs', 'Wkts', '5', '10' : They contain integer values and <br>
* 'Player', 'Span', 'Mat', 'BBI', 'BBM' : Object (String/Mixed Data Type).<br>
* No missing values.<br>

<h4>Data Statistics</h4>

In [5]:
# Checking the data statistics
display(df.describe())

Unnamed: 0,Inns,Balls,Runs,Wkts,Ave,Econ,SR,5,10
count,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0,79.0
mean,144.911392,18638.35443,8599.35443,317.21519,27.469747,2.806835,59.193671,16.35443,2.797468
std,51.180222,7199.256972,3085.168807,121.924911,3.655658,0.351577,9.350132,9.642372,3.235935
min,67.0,8785.0,4846.0,200.0,20.94,1.98,41.2,3.0,0.0
25%,110.0,13583.0,6456.5,229.0,24.5,2.6,53.3,9.5,1.0
50%,129.0,16498.0,7742.0,266.0,28.0,2.82,57.4,14.0,2.0
75%,169.0,21742.5,9756.0,374.5,29.87,3.08,63.95,20.5,3.5
max,304.0,44039.0,18355.0,800.0,34.79,3.46,91.9,67.0,22.0


<h4>Remarks: </h4>



 * Average no. of innings played by a bowler is around 144 with at least 67 innings.
 * Average Runs conceded by a bowler is 8599 where min is 4846.
 * It shows the records of bowlers who have taken at least 200 test wickets. <br> <br>
    

In [6]:
# shows the existing column headings of the dataframe

print(df.columns)

Index(['Player', 'Span', 'Mat', 'Inns', 'Balls', 'Runs', 'Wkts', 'BBI', 'BBM',
       'Ave', 'Econ', 'SR', '5', '10'],
      dtype='object')


<h4>Renaming the column names</h4>

In [7]:
# Renaming the column names

df.rename(columns={  'Span' : 'Career',
                    'Mat':'Match', 
                   'Inns':'Innings',
                    'Wkts': 'Wickets',
                    'BBI': 'Best_Bowling_fig(innings)',
                    'BBM': 'Best_Bowling_fig(Match)',
                    'Ave': 'Average',
                    'Econ': 'Economy',
                    'SR': 'Strike_rate',
                    '5': 'Five_wickets_wholes',
                    '10': 'Ten_wickets_in_a_match'}, inplace= True)

In [8]:
# displaying the first 5 rows of the dataframe which now have the changed column names

df.head(5)

Unnamed: 0,Player,Career,Match,Innings,Balls,Runs,Wickets,Best_Bowling_fig(innings),Best_Bowling_fig(Match),Average,Economy,Strike_rate,Five_wickets_wholes,Ten_wickets_in_a_match
0,M Muralitharan (ICC/SL),1992-2010,133,230,44039,18180,800,9/51,16/220,22.72,2.47,55.0,67,22
1,SK Warne (AUS),1992-2007,145,273,40705,17995,708,8/71,12/128,25.41,2.65,57.4,37,10
2,JM Anderson (ENG),2003-2021,164*,304,35079,16575,623,7/42,11/71,26.6,2.83,56.3,30,3
3,A Kumble (INDIA),1990-2008,132,236,40850,18355,619,10/74,14/149,29.65,2.69,65.9,35,8
4,GD McGrath (AUS),1993-2007,124,243,29248,12186,563,8/24,10/27,21.64,2.49,51.9,29,3


<h4>Removing a column: "Career"</h4>

In [9]:
# Removing the column 'Career' and displaying the dataframe

df.drop('Career', axis=1, inplace=True)

display(df.head())

Unnamed: 0,Player,Match,Innings,Balls,Runs,Wickets,Best_Bowling_fig(innings),Best_Bowling_fig(Match),Average,Economy,Strike_rate,Five_wickets_wholes,Ten_wickets_in_a_match
0,M Muralitharan (ICC/SL),133,230,44039,18180,800,9/51,16/220,22.72,2.47,55.0,67,22
1,SK Warne (AUS),145,273,40705,17995,708,8/71,12/128,25.41,2.65,57.4,37,10
2,JM Anderson (ENG),164*,304,35079,16575,623,7/42,11/71,26.6,2.83,56.3,30,3
3,A Kumble (INDIA),132,236,40850,18355,619,10/74,14/149,29.65,2.69,65.9,35,8
4,GD McGrath (AUS),124,243,29248,12186,563,8/24,10/27,21.64,2.49,51.9,29,3
