# Data Analysis in Python with NumPy and Pandas

Before we can use libraries, we need to import them.

We use the abbreviations `np` and `pd` to make it easier to refer to these packages.

In [2]:
#@title Imports

import numpy as np
import pandas as pd

## The `numpy` array

Recall how to create a Python list:

In [3]:
list1 = [1, 2, 3, 4]

Use the `np.array` function to convert a Python list to a `ndarray`

In [4]:
array1 = np.array(list1)

print(array1)
print(type(array1))

[1 2 3 4]
<class 'numpy.ndarray'>


We start with a Python list (of lists) to create a 2D ndarray.

In [19]:
# TODO: Create a Python list called list2 that stores two lists [1, 2, 3] and [4, 5, 6]
list2 = ...

In [20]:
array2 = np.array(list2)

print(array2)
print(array2.shape)

Ellipsis
()


In [24]:
#@title Test cases (DO NOT EDIT)

assert np.array_equal(array2, np.array([[1, 2, 3], [4, 5, 6]])), f"Expected array2 to be [[1, 2, 3], [4, 5, 6]]. Was {array2}"

AssertionError: Expected array2 to be [[1, 2, 3], [4, 5, 6]]. Was Ellipsis

In [36]:
# You can also create arrays like this:

array3 = np.array([[1, 2, 3], [4, 5, 6]])
print(array3)
print(array3.shape)
print(np.array_equal(array2, array3))

print("--------------------------------------------------")

array4 = np.array([
    [
        [1, 2, 3, 4],
        [4, 5, 6, 7],
        [8, 9, 10, 11]
    ],
    [
        [7, 8, 9, 10],
        [10, 11, 12, 13],
        [14, 15, 16, 17]
    ]
])
print(array4)
print("array4 is a 3D array")
print(array4.shape)

[[1 2 3]
 [4 5 6]]
(2, 3)
False
--------------------------------------------------
[[[ 1  2  3  4]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[ 7  8  9 10]
  [10 11 12 13]
  [14 15 16 17]]]
array4 is a 3D array
(2, 3, 4)


When you have a NumPy array, you can use `.shape` to find out its size:

- `.shape` returns a **tuple** that shows how many rows and columns the array has
- The format is always `(rows, columns)` for 2D arrays

You can think of `.shape` as asking:

> "How big is this array?"

This is called **dot notation** because we access the attribute using a dot: `array.shape`.

Try the examples below and fill in the missing shapes.

In [25]:
# TODO: Create an array with 2 rows and 2 columns
a = np.array(...)

# TODO: Create an array with 1 row and 3 columns
b = np.array(...)

# TODO: Create an array with 3 rows and 1 column
c = np.array(...)

In [26]:
#@title Test cases (DO NOT EDIT)
assert a.shape == (2, 2), f"Expected a to be shape (2, 2). Was {a.shape}"
assert b.shape == (1, 3), f"Expected b to be shape (1, 3). Was {b.shape}"
assert c.shape == (3, 1), f"Expected c to be shape (3, 1). Was {c.shape}"

AssertionError: Expected a to be shape (2, 2). Was ()

### Timing comparisons: Python lists vs. ndarrays

We use ndarrays because it is much faster to perform calculations on them compared to Python lists.

Let's explore how the times differ for the same calculation on the same data for Python lists vs. ndarrays!
Run the code below multiple times :)

In [40]:
import time

# Create a large dataset
size = 10_000_000
py_list = list(range(size))
np_array = np.arange(size)

# Timing: Multiply every element by 2

# Python list (using list comprehension)
start = time.time()
py_result = [x * 2 for x in py_list]  # This is the same as using a for loop
end = time.time()
list_time = end - start
print(f"Python list time: {list_time:.5f} seconds")

# NumPy array (vectorized)
start = time.time()
np_result = np_array * 2
end = time.time()
numpy_time = end - start
print(f"NumPy array time: {numpy_time:.5f} seconds")

# Comparison
speedup = list_time / numpy_time if numpy_time > 0 else float('inf')
print(f"NumPy is about {speedup:.2f}x faster than list comprehension.")


Python list time: 0.34501 seconds
NumPy array time: 0.01750 seconds
NumPy is about 19.72x faster than list comprehension.


## 🐼 Pandas DataFrame

A **DataFrame** is a table of data. It’s like a spreadsheet in Excel or Google Sheets, but in Python.

Each **row** is one record (like one person, one item, one event).  
Each **column** is a variable (like name, age, score, etc.).

You can:

- Create a DataFrame from a Python dictionary
- View the first few rows with `.head()`
- Access a column with `df["column_name"]`
- Get information about the data with `.info()` or `.describe()`

In [44]:
# Create a DataFrame
data = {
    "Student": ["Ava", "Ben", "Cara", "Dan"],
    "Grade": [9, 8, 9, 8],
    "Math_Score": [95, 88, 76, 90]
}

df = pd.DataFrame(data)

df

Unnamed: 0,Student,Grade,Math_Score
0,Ava,9,95
1,Ben,8,88
2,Cara,9,76
3,Dan,8,90


In [45]:
# View the first few rows
print(df.head())

  Student  Grade  Math_Score
0     Ava      9          95
1     Ben      8          88
2    Cara      9          76
3     Dan      8          90


In [49]:
# Access a single column
print("Math Scores:")
print(df["Math_Score"])

Math Scores:
0    95
1    88
2    76
3    90
Name: Math_Score, dtype: int64


In [50]:
# Get summary statistics
print("Description:")
print(df.describe())

Description:
         Grade  Math_Score
count  4.00000    4.000000
mean   8.50000   87.250000
std    0.57735    8.057088
min    8.00000   76.000000
25%    8.00000   85.000000
50%    8.50000   89.000000
75%    9.00000   91.250000
max    9.00000   95.000000


In [52]:
# Add a new column
df["Passed"] = df["Math_Score"] >= 80
df

Unnamed: 0,Student,Grade,Math_Score,Passed
0,Ava,9,95,True
1,Ben,8,88,True
2,Cara,9,76,False
3,Dan,8,90,True


### Test Your Understanding

1. Create a DataFrame with this dictionary:

    ```python
    {
        "item": ["pencil", "notebook", "eraser"],
        "price": [1.5, 3.0, 0.75],
        "in_stock": [True, False, True]
    }
    ```

2. Use `.head()` to print the first few rows.
3. Add a new column called "discounted" that is `True` if the price is less than `2.0`.

In [54]:
# TODO: Create a dictionary called store with keys: item, price, in_stock
store = ...

# TODO: Create a DataFrame called df with the dictionary store
df = ...

In [None]:
# TODO: Call head on your data

In [None]:
# TODO: Add new column discounted that is True if the price < 2.0

In [58]:
#@title Test cases (DO NOT EDIT)
assert store == {
     "item": ["pencil", "notebook", "eraser"],
    "price": [1.5, 3.0, 0.75],
    "in_stock": [True, False, True]
}, "store dictionary is incorrect. Are dictionary keys lowercase?"
assert isinstance(df, pd.DataFrame), "df should be a DataFrame"
assert df.shape == (3, 4), f"Shape should be (3, 4). Was {df.shape}"
assert df["discounted"].tolist() == [True, False, True], "Discounted column values are incorrect"
assert "in_stock" in df.columns, "in_stock column missing"

AssertionError: store dictionary is incorrect. Are dictionary keys lowercase?

## Spotify Dataset

In [None]:
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
file_path = ""

# Load the latest version
df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "nelgiriyewithana/top-spotify-songs-2023",
  file_path,
  # Provide any additional arguments like
  # sql_query or pandas_kwargs. See the
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)