# Basics of Data Analysis in Python!
Welcome to the basics of data analytics in python! This will cover the basic functionality of python libraries like pandas, numpy, and matplotlib that you may use to manipulate and visualize data.

### A) IMPORT LIBRARIES

In [118]:
import pandas as pd
import numpy as np
import matplotlib as plt

### B) LOADING DATA
The first step is to read in the data that has been collected. To do this we will use the pandas python library. Pandas is a popular open-source data manipulation and analysis library for Python. It provides data structures for efficiently storing and manipulating large datasets and tools for working with structured data. The primary data structure in pandas is the 'DataFrame' package.

##### Creating a Pandas Dataframe
Reading in a CSV file and an Excel spreadsheet

In [119]:
# Read in csv file stored in data_analysis/data_samples
csv_input = './Data Samples/nba_players.csv'
csv_df = pd.read_csv(csv_input)

# Display the new DataFrame
display(csv_df)

# Read in .xlsx file stored in data_analysis/data_samples
# xlsx_input = './Data Samples/sample.xlsx'
# xlsx_df = pd.read_excel(xlsx_input)



Unnamed: 0,NBA Player,Salary,Points Per Game,Games Played
0,LeBron James,39000000,25.4,55
1,Kevin Durant,38000000,27.0,60
2,Stephen Curry,43000000,30.5,58
3,Giannis Antetokounmpo,39270000,28.1,57
4,Kawhi Leonard,34400000,26.9,53
5,Luka Dončić,8550000,28.8,59
6,Anthony Davis,37000000,23.0,56
7,James Harden,42000000,25.0,54
8,Joel Embiid,31870000,28.5,52
9,Jayson Tatum,28000000,24.8,57


#### Exploring the Data

In [120]:
n = 5

# Remember python index starts at 0, not 1
# First n rows of the DataFrame
print(f"Head: {csv_df.head(n)}\n")

# Last n rows of the DataFrame
print(f"Tail: {csv_df.tail(n)}\n")

# Number of rows and columns in the DataFrame
print(f"Shape: {csv_df.shape}\n")

# Summary of the DataFrame
print(f"Info: ")
print(csv_df.info())

# Descriptive statistics of the DataFrame
print(f"Describe: {csv_df.describe()}\n")

# List of column names
print(f"Columns: {csv_df.columns}\n")

# Sum, mean, standard deviation, minimum, and maximum of each column
# print(f"Sum: {csv_df.sum()}\nMean: {csv_df.mean()}\nStD: {csv_df.std()}\nMin: {csv_df.min}\nMax: {csv_df.max}")

Head:               NBA Player    Salary  Points Per Game  Games Played
0           LeBron James  39000000             25.4            55
1           Kevin Durant  38000000             27.0            60
2          Stephen Curry  43000000             30.5            58
3  Giannis Antetokounmpo  39270000             28.1            57
4          Kawhi Leonard  34400000             26.9            53

Tail:          NBA Player    Salary  Points Per Game  Games Played
25      Rudy Gobert  41800000             14.3            55
26      CJ McCollum  29730000             23.1            50
27  Khris Middleton  33480000             20.4            55
28     Jamal Murray  29230000             21.2            52
29   Brandon Ingram  27093019             23.8            56

Shape: (30, 4)

Info: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   NB

#### Sorting the Data

In [121]:
# Sorting data based on player salary
# Function defaults to ascending order, ascending=False changes to descending order
sorted_df = csv_df.sort_values(by='Salary', ascending=False)

n = 5
# Display N highest paid players using .head
print(f"{str(n)} highest paid players in the DataFrame:")
display(sorted_df.head(n))

# Display N lowest paid players using .tail
print(f"{str(n)} lowest paid players in the DataFrame:")
display(sorted_df.tail(n))


5 highest paid players in the DataFrame:


Unnamed: 0,NBA Player,Salary,Points Per Game,Games Played
12,Damian Lillard,43800000,28.8,56
2,Stephen Curry,43000000,30.5,58
7,James Harden,42000000,25.0,54
25,Rudy Gobert,41800000,14.3,55
18,Chris Paul,41320000,16.4,58


5 lowest paid players in the DataFrame:


Unnamed: 0,NBA Player,Salary,Points Per Game,Games Played
23,Deandre Ayton,10250000,14.4,56
16,Zion Williamson,10250000,27.0,50
21,Trae Young,8700000,28.9,56
5,Luka Dončić,8550000,28.8,59
17,De'Aaron Fox,8500000,24.5,55


#### Cleaning Bad Data Entries
In some cases, there may be data entries with null values, or with outliers that we want to filter out. In this example, we will add four data entries; two with null values, two with outlying values. *\*Unsure of a more efficient way to add these rows to a dataframe\**

In [122]:
# First, we add the bad data entries into a list
new_entries = [
    {'NBA Player': 'Tyler Trimble', 'Salary': 700000000, 'Points Per Game': 20, 'Games Played': 21},
    {'NBA Player': 'Kaela Nel', 'Salary': 5928839, 'Points Per Game': 28.2, 'Games Played': 44},
    {'NBA Player': 'Clairiz Nel', 'Salary': 394993, 'Points Per Game': 19.1, 'Games Played': None},
    {'NBA Player': 'Connor Nel', 'Salary': 848929, 'Points Per Game': None, 'Games Played': 75},
]

# Create a temporary dataframe that holds the new entries
temp_df = pd.DataFrame(new_entries)

# Concatenate (merge) the temporary dataframe with the already existing one 
# ignore_index=True just concat's the values to the end of the dataframe
df = pd.concat([csv_df, temp_df], ignore_index=True)
display(df.tail(10))

Unnamed: 0,NBA Player,Salary,Points Per Game,Games Played
24,Jrue Holiday,26130000,17.7,54.0
25,Rudy Gobert,41800000,14.3,55.0
26,CJ McCollum,29730000,23.1,50.0
27,Khris Middleton,33480000,20.4,55.0
28,Jamal Murray,29230000,21.2,52.0
29,Brandon Ingram,27093019,23.8,56.0
30,Tyler Trimble,700000000,20.0,21.0
31,Kaela Nel,5928839,28.2,44.0
32,Clairiz Nel,394993,19.1,
33,Connor Nel,848929,,75.0


Now that the four bad values have been added, we want to first clean all rows with null values. We then want to check values that are outliers by implementing a check algorithm.

In [123]:
# Display rows with null values
print(f"Entries with null values:")
display(df[df.isnull().any(axis=1)])

# Delete any rows with null values
df = df.dropna()

# Function to filter outliers based on interquartile range (IQR)
def filter_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    print(f"Lower bound: {lower_bound}")
    print(f"Upper bound: {upper_bound}")
    print(f"Outliers:\n {df[(df[column] < lower_bound) | (df[column] > upper_bound)]}")
    
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Filter outliers in the 'Salary' column
df_filtered = filter_outliers(df, 'Salary')
# display(df_filtered)

Entries with null values:


Unnamed: 0,NBA Player,Salary,Points Per Game,Games Played
32,Clairiz Nel,394993,19.1,
33,Connor Nel,848929,,75.0


Lower bound: 8529410.625
Upper bound: 57390353.625
Outliers:
        NBA Player     Salary  Points Per Game  Games Played
17   De'Aaron Fox    8500000             24.5          55.0
30  Tyler Trimble  700000000             20.0          21.0
31      Kaela Nel    5928839             28.2          44.0


### C) VISUALIZING DATA

### D) SAVING DATA