### Hello, Team CMSHS!

## Introduction to data analytics for beginners

In this notebook, you will learn how to use Python and data analytics tools to explore the NBA game scores dataset. The goal is to build your skills step-by-step, from understanding what data is, to analyzing it with Python and Pandas, and finally visualizing and summarizing insights!

Make sure that you have downloaded the NBA dataset (nba_data.csv) and the Jupyter notebook placed in your Jupyter lab area.

### Part 0: What is a Library? Installing Pandas & NumPy

Before we dive into data analysis, let's understand what a **library** is in programming.

---

### What is a Library?

- A **library** is a collection of pre-written codes that you can use to make programming easier.
- Instead of writing everything from scratch, you can **import** libraries that help you with specific tasks.
- For example:
  - **Pandas** helps with working with tables of data.
  - **NumPy** helps with numerical calculations.

---

### How to Install Libraries (Pandas, NumPy)

If you don’t have these libraries installed yet, here are **two ways** to install them:

#### 1. Using Command Prompt / Terminal

Open your command line interface and run:

```bash
pip install pandas numpy
```

This installs Pandas and NumPy globally for your Python environment.

---

#### 2. Using Jupyter Notebook directly

In a Jupyter notebook cell, type this and run it:

```python
!pip install pandas numpy
```

The exclamation mark `!` lets you run command-line instructions from inside the notebook.

---

After installation, you can import these libraries in your code like this:

```python
import pandas as pd
import numpy as np
```

---

Let's test if you have Pandas and NumPy installed by importing them now!


```python
# Test importing pandas and numpy
try:
    import pandas as pd
    import numpy as np
    print("Pandas version:", pd.__version__)
    print("NumPy version:", np.__version__)
    print("Libraries are installed and imported successfully!")
except ImportError as e:
    print("One or more libraries are not installed yet.")
    print("Please install them using the instructions above and restart the notebook.")
```

## Day 1: Foundations

### 1. What is Data? Why NBA Data?
Data is information collected in a structured format. In sports, data can tell stories about wins, scores, player performance, and more.

Today, you get to dive into real NBA data: scores from games, teams playing, and the results. Let's start by loading the data!

In [5]:
# 1: Importing libraries and loading the data
import pandas as pd

# Load nba_data.csv into a DataFrame called nba_df
nba_df = pd.read_csv('nba_data.csv')

# Show first few rows (pro-tip: you can customize head by adding a number inside it! try .head(2) -> this returns the first 2 rows!)
nba_df.head()


Unnamed: 0,GAME_ID,GAME_DATE,HOME_TEAM,AWAY_TEAM,HOME_SCORE,AWAY_SCORE,DIFFERENCE
0,21000001,2010-10-26,MIA,BOS,88,80,8
1,21000002,2010-10-26,PHX,POR,106,92,14
2,21000003,2010-10-26,HOU,LAL,112,110,2
3,21000004,2010-10-27,BOS,CLE,95,87,8
4,21000005,2010-10-27,NJN,DET,101,98,3


#### Quick description of the dataset


| Column Name   | What It Means                                                       |
|---------------|--------------------------------------------------------------------|
| **index**     | A number that shows the row’s position in the table (like a line number). It just helps us find or reference each row. |
| **GAME_ID**   | A unique number that identifies each NBA game. |
| **GAME_DATE** | The date when the game was played (year-month-day).               |
| **HOME_TEAM** | The abbreviation (short name) of the team playing *at home* for that game. “MIA” means Miami Heat, “BOS” means Boston Celtics, etc. |
| **AWAY_TEAM** | The abbreviation of the team playing *away* (the visiting team).   |
| **HOME_SCORE**| The total points scored by the home team in that game.             |
| **AWAY_SCORE**| The total points scored by the away team in that game.             |
| **DIFFERENCE**| The difference in points between the two teams (home score minus away score). It shows how close or one-sided the game was. |

---

### Example:

| GAME_ID | GAME_DATE | HOME_TEAM | AWAY_TEAM | HOME_SCORE | AWAY_SCORE | DIFFERENCE |
|---------|------------|-----------|-----------|------------|------------|------------|
| 21000001| 2010-10-26 | MIA       | BOS       | 88         | 80         | 8          |

This means on October 26, 2010, the Miami Heat (home team) played against the Boston Celtics (away team). Miami scored 88 points, Boston scored 80 points, so Miami won by 8 points.

### 2. Python Basics: Variables, Data Types, and Loops

In Python, we use variables to store values. Here's a quick introduction:

- Numbers: `points = 88` -> numbers can be integers (`int`, e.g., 1,2,3,), float (`float`, e.g., 1.1, 0.001, 0.5, -9.99)
- Strings: `team = "MIA"`
- Lists: `scores = [88, 80, 106]`

Let's try some simple Python commands!

In [2]:
# Variables and print example
home_score = 88
away_score = 80
print("The home team scored", home_score, "points.")
print("The away team scored", away_score, "points.")

# Calculate difference
difference = home_score - away_score
print("The score difference was:", difference)

The home team scored 88 points.
The away team scored 80 points.
The score difference was: 8


### Mini Quiz 1: Python Basics

**Question:** What will be the output of this code?

```python
score = 100
if score > 90:
    print("Great game!")
else:
    print("Keep trying!")
```

a) Great game!  
b) Keep trying!  
c) Error  
d) Nothing

(Type your answer below)

### 3. Conditions and Loops

Loops help us repeat actions. Here’s how to find all games with home score > 100 using a loop:

```python
# Example list of home scores
home_scores = [88, 106, 112, 95, 101]

print("Games where home team scored more than 100 points:")
for score in home_scores:
    if score > 100:
        print(score)
```

Try looping over real data next!




```python
# Loop through home scores in nba_df and print the game IDs where home score > 100
print("Game IDs where home team scored more than 100 points:")
for idx, row in nba_df.iterrows():
    if row['HOME_SCORE'] > 100:
        print(row['GAME_ID'])
```

Ok, wait, what is iterrows?

**iterrows() lets you *go through each row* in a table (Pandas DataFrame) one at a time, so you can look at or work with the data for that row.**

- Think of your DataFrame like a giant spreadsheet or a list of game records.
- Sometimes you want to check each game one by one.
- `iterrows()` gives you a way to move through every row, like reading the spreadsheet row by row.

---

**Analogy:**

Imagine your class attendance sheet:

| Student | Present? |
|---------|----------|
| Alice   | Yes      |
| Bob     | No       |
| Carla   | Yes      |

If you want to call out each student’s name and say if they are present, you’d go down the list row by row.

`iterrows()` is like doing that in code: going through each row with data you can use.


### 4. Introduction to Pandas and DataFrames

A **DataFrame** is a table of data (rows and columns). It's similar to a spreadsheet.

We can use Pandas library to **load**, **view**, and **analyze** data easily.

We already loaded `nba_df`. Let's explore important commands:

- `nba_df.head()` shows first 5 rows
- `nba_df.info()` shows data info
- `nba_df.shape` shows the number of rows (first entry) and columns (second entry)
- `nba_df.describe()` shows basic statistics

Try running these now!


```python
# DataFrame exploration commands
print("Data info:")
nba_df.info()

print("\nStatistical summary:")
nba_df.describe()

print("\nShape (rows, columns):")
nba_df.shape
```

### 5. Exploring the Data: Columns, Filtering, and Adding New Information

We have a big table of NBA game results. Now, let's learn how to:

- **See just one part of the data** (like just the home team scores)  
- **Find specific games** you’re interested in (for example, where the home team won)  
- **Create a new column** that tells us extra information, like which team won each game  

---

#### 5.1 Looking at a Single Column

Think of a DataFrame like a table or spreadsheet with many columns. If you want to see only *one* column, like the "HOME_SCORE" (points scored by the home team), you can do this:

```python
# Look at just the 'HOME_SCORE' column
home_scores = nba_df['HOME_SCORE']
print("Here are the first 5 home team scores:")
print(home_scores.head())
```

*Try running it; you’ll see only the points the home teams scored in the first few games.*

---

#### 5.2 Filtering Rows (Selecting Specific Games)

What if you want to see *only the games where the home team won?*

A team wins if its score is higher than the other team’s. To find those games, we filter the DataFrame for rows where `HOME_SCORE` is greater than `AWAY_SCORE`:

```python
# Find games where the home team won
home_wins = nba_df[nba_df['HOME_SCORE'] > nba_df['AWAY_SCORE']]

# Count how many home wins
print(f"Number of home team wins so far: {home_wins.shape[0]}")
```

This line says: “Give me all the rows (games) where the home team scored *more* points than the away team.” Run it to see how many such games there are.

---

#### 5.3 Adding a New Column: Who Won Each Game?

Right now, the table doesn’t tell us directly who the winner is. Let’s create a new column called `"WINNER"` that shows which team won each game:

```python
# Function to decide the winner of a game
def get_winner(row):
    if row['HOME_SCORE'] > row['AWAY_SCORE']:
        return row['HOME_TEAM']      # Home team wins
    elif row['AWAY_SCORE'] > row['HOME_SCORE']:
        return row['AWAY_TEAM']      # Away team wins
    else:
        return "Tie"                 # If both teams scored the same points

# Apply this function to every row and create a new 'WINNER' column
nba_df['WINNER'] = nba_df.apply(get_winner, axis=1)

# Look at the updated data with the new column
nba_df[['GAME_ID', 'HOME_TEAM', 'AWAY_TEAM', 'HOME_SCORE', 'AWAY_SCORE', 'WINNER']].head()
```

The function `get_winner` looks at each game (row) and decides which team scored more points. The `.apply()` part adds the results as a new column to the DataFrame.

---

#### 5.4. Filtering in action
Let's write a code to filter for games where the point difference was 10 points or more. Display the first 3 such games with columns: `GAME_ID`, `HOME_TEAM`, `AWAY_TEAM`, `DIFFERENCE`. Finally, find the number of games satisfying this condition.

To do it:
```python
big_diff_games = nba_df[nba_df['DIFFERENCE'] >= 10] # point difference was 10 pts or more

big_diff_games[['GAME_ID', 'HOME_TEAM', 'AWAY_TEAM', 'DIFFERENCE']].head(3) # display the first 3 rows

big_diff_games.shape()[0] # don't fret! this just returns the number of games!


### Another quiz!

Can you tell me the number of games that had a point difference of less than 3? And give me the first 5 games satisfying the condition (use .head)?

That's it for day 1!

For your assignment today and on Friday, kindly scan through this notebook, add this to your Jupyter lab environment, and answer the mini quizzes when you can. Make sure that the dataset (nba_data.csv) is in the SAME directory/location as the Jupyter notebook (.ipynb).

Questions? Kindly send them over to me (jmaulino@its.jnj.com) Ms. Elisha (EDacana2@its.jnj.com) or Sir Francis and I'll get back to you as soon as I can! I'll see you on Monday next week, November 24!