# 1. Loading Dataset


In [5]:
import pandas as pd
original_df = pd.read_csv("/content/drive/MyDrive/Data Science/nba.csv")
#print(original_df)

# 2. Data Preprocessing

In [6]:
# Removing any rows with blanks or nulls.
original_df.dropna(inplace=True)

# Removing duplicate rows
original_df.drop_duplicates(inplace=True)

# Shuffle data and define train-test split (80% is for training, 20% is for testing)
train_df = original_df.sample(frac=0.8, random_state=50) # First 80% for training
test_df = original_df.drop(train_df.index) # Remaining 20% for testing
#print(train_df)
#print(test_df)

# 3. Statistical Description

In [7]:
print("Statistical description for training dataset:")
print(train_df.describe())

print("\nStatistical description for testing dataset:")
print(test_df.describe())

Statistical description for training dataset:
           Number         Age      Weight        Salary
count  291.000000  291.000000  291.000000  2.910000e+02
mean    16.652921   26.718213  218.752577  4.707358e+06
std     14.653933    4.200531   24.527549  5.161403e+06
min      0.000000   19.000000  161.000000  5.572200e+04
25%      5.000000   24.000000  200.000000  1.011224e+06
50%     12.000000   26.000000  220.000000  2.525160e+06
75%     24.000000   29.000000  238.000000  6.315702e+06
max     99.000000   40.000000  275.000000  2.287500e+07

Statistical description for testing dataset:
          Number        Age      Weight        Salary
count  73.000000  73.000000   73.000000  7.300000e+01
mean   17.534247  26.205479  223.904110  4.273317e+06
std    16.369078   4.368317   25.582744  4.969561e+06
min     0.000000  20.000000  175.000000  5.572200e+04
25%     5.000000  23.000000  205.000000  9.472760e+05
50%    12.000000  25.000000  225.000000  2.500000e+06
75%    30.000000  28.00000

### Interpretation of Statistical Description

Count = Number of non-null values in each column.

Mean = Average value of each column.

Std(Standard Deviation) = Measures how spread out the values are.

Min = Smallest value in each column.

25% = First quartile (25th percentile).

50% = Middle value.

75% = Third quartile (75th percentile).

Max = Largest value in each column.


# 4. Correlation Analysis

In [10]:
# selecting 3 teams to filter
dataframe_Boston = original_df.loc[original_df['Team'] == 'Boston Celtics']
dataframe_Brooks = original_df.loc[original_df['Team'] == 'Brooklyn Nets']
dataframe_NewYork = original_df.loc[original_df['Team'] == 'New York Knicks']
# print(dataframe_Boston)

# Merging 3 data frames of 3 teams into one dataframe
merged_dataframe = pd.concat([dataframe_Boston, dataframe_Brooks, dataframe_NewYork])
# print(merged_dataframe)

# Creating salary list of each team
boston_salaries = merged_dataframe[merged_dataframe['Team'] == "Boston Celtics"]["Salary"].reset_index(drop=True)
brooklyn_salaries = merged_dataframe[merged_dataframe['Team'] == "Brooklyn Nets"]["Salary"].reset_index(drop=True)
newyork_salaries = merged_dataframe[merged_dataframe['Team'] == "New York Knicks"]["Salary"].reset_index(drop=True)
# print(boston_salaries)

# Combining salaries of three teams
salary_dataframe = pd.DataFrame({
    "Boston Celtics": boston_salaries,
    "Brooklyn Nets": brooklyn_salaries,
    "New York Knicks": newyork_salaries
})

# Removing any rows with blanks or nulls
salary_dataframe.dropna(inplace=True)

# Removing duplicate rows
salary_dataframe.drop_duplicates(inplace=True)
# print(salary_dataframe)

# Computing correlation between salaries columns
correlation = salary_dataframe.corr()

print("\nCorrelation Matrix for Salaries of Selected Teams:")
print(correlation)


Correlation Matrix for Salaries of Selected Teams:
                 Boston Celtics  Brooklyn Nets  New York Knicks
Boston Celtics         1.000000      -0.102263        -0.128966
Brooklyn Nets         -0.102263       1.000000         0.304434
New York Knicks       -0.128966       0.304434         1.000000


### Insights of Correlation Analysis

1 = Strong Positive Correlation/ 0= No Correlation/ -1= Strong Negative Correlation

Based on the result of first row,

Boston Celtics & Boston Celtics = 1 = Self correlation

Boston Celtics & Brooklyn Nets = -0.102263 (Strong Negative correlation)

Boston Celtics & New York Knicks = -0.128966 (Strong Negative correlation)

Based on the result of second row,

Brooklyn Nets & Boston Celtics = -0.102263 (Strong Negative correlation)

Brooklyn Nets & Brooklyn Nets = 1 = Self correlation

Boston Celtics & New York Knicks = 0.304434 (Weak Positive correlation)

Based on the result of second row,

New York Knicks & Boston Celtics = -0.128966 (Strong Negative correlation)

New York Knicks & Brooklyn Nets = 0.304434 (Weak Positive correlation)

New York Knicks & New York Knicks = 1 = Self correlation
