### DOMAIN: 
Sports
### CONTEXT: 
Company X manages the men's top professional basketball division of the American league system. The dataset contains information on all the teams that have participated in all the past tournaments. It has data about how many baskets each team scored, conceded, how many times they came within the first 2 positions, how many tournaments they have qualified, their best position in the past, etc.
### DATA DESCRIPTION: 
Basketball.csv - The data set contains information on all the teams so far participated in all the past tournaments.
### DATA DICTIONARY:
1. Team: Team’s name
2. Tournament: Number of played tournaments.
3. Score: Team’s score so far.
4. PlayedGames: Games played by the team so far.
5. WonGames: Games won by the team so far.
6. DrawnGames: Games drawn by the team so far.
7. LostGames: Games lost by the team so far.
8. BasketScored: Basket scored by the team so far.
9. BasketGiven: Basket scored against the team so far.
10. TournamentChampion: How many times the team was a champion of the tournaments so far.
11. Runner-up: How many times the team was a runners-up of the tournaments so far.
12. TeamLaunch: Year the team was launched on professional basketball.
13. HighestPositionHeld: Highest position held by the team amongst all the tournaments played.

### PROJECT OBJECTIVE:
Company’s management wants to invest on proposals on managing some of the best teams in the league. The analytics department has been assigned with a task of creating a report on the performance shown by the teams. Some of the older teams are already in contract with competitors. Hence Company X wants to understand which teams they can approach which will be a deal win for them.

## 1.) Read the data set, clean the data and prepare final dataset to be used for analysis.

### Import Python Libraries

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print('Libraries imported successfully')

### Read the data set

In [8]:
df1 = pd.read_csv("../input/greatlearningaimlassignment-datasets/DS - Part2 - Basketball.csv") # Load the dataset in Pandas dataframe
df1.head() # Display first 5 rows

From the first 5 rows of the dataset, the table looks OK. However, this does not give us any indication of the datatypes.
Also one value in TeamLaunch column in 1931to32 which can create problem in EDA. This column needs to be checked and rectified as required.

### Clean the data

In [9]:
df1.shape # Check the size of the tabular data

There are 61 rows and 13 columns in the dataframe basketball.

In [10]:
df1.info() # Check basic information about the dataframe

The Team column should be the only column with 'object' data type as it is a categorical variable.

But, only Tournament and HighestPositionHeld columns have 'int64' data type.

In [11]:
df1.describe(include='all') # Check basic statistics of the dataframe

Statistics for only 2 columns (Tournament & HighestPositionHeld) are being displayed indicating that the other columns are recognized as non-numerical by Pandas.
Most of the numerical columns are not populating descriptive statistics and instead showing NaN.
TournamentChampion and Runner-up columns have - in top row indicating that some rows in these columns don't have numerical values.

In [12]:
df1.tail() # Display last 5 rows

It is clear from the above table that Team 61 Row, TournamentChampion and Runner-up coulmns have '-' that needs to be replaced. We can assume that the '-' indicates that data is not available at these locations and '-' should be replaced with 0. It would not be prudent to drop the rows and columns containing '-' as they have some valid data points available.

In [13]:
df1=df1.replace('-',0) #Replace '-' with 0
df1.tail()

The TeamLaunch column contains 'Year the team was launched on professional basketball'. This implies that there can only be one year on which a team was launched. The team cannot be launched over a period of 2 or more years. Hence, to rectify this column we can ignore the second year that appears and consider only the first 4 characters are the year of launch of the team.

In [14]:
df1['TeamLaunch'] = df1['TeamLaunch'].str[:4] #Extract first 4 characters from TeamLaunch column as the year
df1['TeamLaunch']

### Prepare Final Dataset for Analysis

In [15]:
basketball = df1.astype("int64", errors='ignore') #Convert datatype to numerical for EDA
basketball.info() #Check info of cleaned data

This step has converted the datatype of all numerical columns to integers, since the numeric columns have integer values.

In [16]:
basketball.describe() #Check statistics after correcting the datatypes

In [17]:
#Let's create few more columns for analysis purpose as the table has only one categorical variable
basketball['IfWinner']=basketball['TournamentChampion']>0 #Has the team ever been a tournament champion?
basketball['BasketDiff']=basketball['BasketScored']-basketball['BasketGiven'] #Basket Difference
basketball['NetScorer']=basketball['BasketDiff']>0 #Is the basket difference of a team positiive? Scored more than conceded?
basketball['Win Percent']=round((basketball['WonGames']/basketball['PlayedGames'])*100,2) #Winning percentage of teams
basketball['Loss Percent']=round((basketball['LostGames']/basketball['PlayedGames'])*100,2) #Loss percentage of teams
basketball['Drawn Percent']=round((basketball['DrawnGames']/basketball['PlayedGames'])*100,2) #Draw percentage of teams
basketball['NetWinner']=(basketball['Win Percent']>basketball['Loss Percent']) & (basketball['Win Percent']> basketball['Drawn Percent'])#Whether the team is a net winner?
basketball.head()

## 2.) Perform detailed statistical analysis and EDA using univariate, bi-variate and multivariate EDA techniques to get data driven insights on recommending which teams they can approach which will be a deal win for them. Also as a data and statistics expert you have to develop a detailed performance report using this data

In [18]:
basketball.hist(figsize=(20,30)); #Plot histogram of all numerical variables

**Observations**
The data looks to be skewed to the right as most of the data is concentrated towards the left end of the histograms.

In [19]:
corr_mat=basketball.corr() #Correlation matrix
fig, ax = plt.subplots(figsize=(15,5)) #Set Figure Size
sns.heatmap(corr_mat,annot=True); #Plot heatmap of correlation matrix

In [20]:
#Which team has won most tournaments?
temp = basketball.sort_values('TournamentChampion',ascending=False).head(10)
fig, ax = plt.subplots(figsize=(15,5)) #Set Figure Size
sns.barplot(x="Team", y = 'TournamentChampion',data=temp,hue='NetScorer');

Team 1 has won the most tournaments.
Teams 8,10 and 11 have won the tournaments with a negative basket difference. They have conceded more baskets than scored.

In [21]:
#Which teams have the highest scores?
temp = basketball.sort_values('Score',ascending=False).head(10)
fig, ax = plt.subplots(figsize=(15,5)) #Set Figure Size
sns.barplot(x="Team", y = 'Score',data=temp,hue='IfWinner');

Teams 7 and 9 are in the Top 10 scoring teams and yet haven't won the championship even once.

In [22]:
#Which is the oldest team?
basketball.sort_values('TeamLaunch',ascending=True).head(1)

Team 1 is the oldest team!

In [23]:
#Youngest/Newest team
basketball.sort_values('TeamLaunch',ascending=False).head(1)

Team 61 is the youngest team!

In [24]:
#Team Launch Year distribution
sns.displot(x='TeamLaunch',data=basketball)

In [25]:
#Is there a relation between number of tournaments played and tournament championships?
sns.scatterplot(data=basketball, x="Tournament", y="TournamentChampion");

Teams that have played more than ~60 tournaments have won multiple championships.

In [26]:
#Is there a relation between team launch year and tournament championships?
sns.scatterplot(data=basketball, x="TeamLaunch", y="TournamentChampion");

Older teams (<1940) have won multiple championships. Newer teams haven't fared well with winning championships.

In [27]:
sns.scatterplot(data=basketball, x="TeamLaunch", y="Loss Percent",hue='IfWinner');

Winning teams have low loss percent.
New teams have very high loss percent.

In [28]:
#Team with Highest Goals
b = basketball.sort_values('BasketScored',ascending=False).head()
sns.barplot(x="Team", y = 'BasketScored',data=b); 

Team 1 has scored maximum baskets.
Team 2 has also scored similar number of baskets as Team 1 but less than Team 1.

In [29]:
#Recommendations - Which teams to approach for a winning deal
c=basketball.sort_values('TournamentChampion',ascending=False).head(10) #Top 10 tournament champions
corr_mat_c=c.corr() #Correlation matrix
fig, ax = plt.subplots(figsize=(15,5)) #Set Figure Size
sns.heatmap(corr_mat_c,annot=True); #Plot heatmap of correlation matrix

Based on above heat map - BasketDiff, Win Percent, Loss Percent, Runner-up and Games Won are highly correlated with Tournament Championships.
**NOTE** - Correlation does not indicate causation. These are not the only factors leading to championships. But they are good indicators as to which teams can succeed.

In [30]:
#Minimum metrics to become Tournament Champion
c[['BasketDiff','Win Percent','Loss Percent','Runner-up','WonGames']].min() 

In [31]:
d = basketball.sort_values('TournamentChampion',ascending=False) #Top 10 tournament champions
d = d[10:] #Teams other than Top 10 champions
# filter dataframe based on minimum metrics
d = d.loc[(d['BasketDiff']>=-333) & (d['Win Percent']>=32.5) & (d['Runner-up']>=0) & (d['WonGames']>=52)]
#Applying the Loss Percent filter results in 0 teams to be recommended
d.sort_values('TeamLaunch',ascending=False) # Sort by newest teams first

Based on the above analysis, I'd recommend Company X to approach Team Number 21, 19, 9, 18, 29 and 7 to get a winning deal. 

In [32]:
#Team that has come as runner up at least 1 time
d = d.loc[(d['BasketDiff']>=-333) & (d['Win Percent']>=32.5) & (d['Runner-up']>0) & (d['WonGames']>=52)]
d.sort_values('Loss Percent',ascending=True) # Sort by lowest Loss Percent first

Team 21 is the most promising team based on the above analysis due to the following reasons:
1. TeamLaunch is 1998 which is relatively young as most championship winning teams are older than 1940.
2. Older teams are in contract with Company X. We don't know which teams are in contract. Hence, it is safe to assume that a new team is not in contract with Company X.
3. Team 21 has also been a runner-up once in the tournament which boosts its future prospects to succeed.
4. Team 21 is a NetScorer and a NetWinner compared to the other team which is old.