# Tutorial 4: Numpy, Correlation and Exploratory Data Analysis

### Learning Goals:
After completing this notebook you will be able to

- Understand and manipulate NumPy arrays for numerical computations.  
- Perform sorting, reshaping, and aggregation operations using NumPy.  
- Analyze relationships between variables using correlation techniques.  
- Conduct basic exploratory data analysis (EDA) to derive insights from data.


### Numpy: Overview

- Numpy is based on numpy arrays, and it is similar to Python list. The difference is numpy arrays are faster and more functional.
- Unlike Python lists, numpy arrays must be homogeneous, that means, all elements in the array must be the same data type.
- [Official documentation at numpy.org](https://numpy.org/doc/stable/user/index.html#user)
- [The absolute basics for beginners ](https://numpy.org/doc/stable/user/absolute_beginners.html)


In [1]:
#import numpy
import numpy as np

A Numpy array is a central data structure of the NumPy library. To create a NumPy array, you can use the function **np.array()**.

In [2]:
# python list
list_l1 = [0, 1, 2, 3]

# numpy array created from a list
array_n = np.array(list_l1)

print("List:", list_l1)
print("Numpy array:", array_n)

print(type(list_l1))
print(type(array_n))

List: [0, 1, 2, 3]
Numpy array: [0 1 2 3]
<class 'list'>
<class 'numpy.ndarray'>


In [3]:
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print("Numpy array:\n\n", a)

# We can access the elements in the array using square brackets.
print("\nFirst element of the array")
print(a[0])

Numpy array:

 [[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]

First element of the array
[1 2 3 4]


### Array vs Lists

- Arrays and Lists might look similar but Numpy Arrays are much faster than lists. This allows faster operations. See an example below.


In [4]:
import numpy as np
import time
from datetime import datetime


# With NumPy
arr = np.arange(1000000) #creates a numpy array with values from 0 to 999,999
start = time.time() # the time in seconds since epoch (January 1, 1970 , 00:00:00 UTC), so this is current time
#dt = datetime.fromtimestamp(start)  # current time in human readable format
#print(dt)
arr2 = arr * 2
print("NumPy execution time:", time.time() - start)

# With List
lst = list(range(1000000))
start = time.time()
lst2 = [x * 2 for x in lst]
print("List execution time:", time.time() - start)


NumPy execution time: 0.0035619735717773438
List execution time: 0.036775827407836914


- NumPy supports vectorized operations, so you can operate on entire arrays without writing explicit loops.

In [5]:
# NumPy Array
arr = np.array([1, 2, 3])
arr * 2  # array([2, 4, 6])


array([2, 4, 6])

In [6]:
# Python List
lst = [1, 2, 3]
[x * 2 for x in lst]  # Slower and more verbose


[2, 4, 6]

- Besides creating an array from a sequence of elements, you can easily create an array filled with 0’s, 1's or the identity matrix:

### Built in functions

In [None]:
print(np.zeros((2,3)))        # a matrix of zeros
print(np.ones((3,2)))         # a matrix of ones
print(np.eye(3))              # identity matrix

### Sequence and random values

You can create an array with a range of elements, either randomly or as a sequence:

In [7]:
# Sequence from 0 to 3
print(np.arange(4))

# Sequence from 2 to 9 with a step of 2
print(np.arange(2, 9, 2))



[0 1 2 3]
[2 4 6 8]


#### Task 1 on Sorting array

#### 📝 Instructions:

1. Create a 1D NumPy array with the following values:

   [2, 1, 5, 3, 7, 4, 6, 8]

   <br>
   

2. Sort the array in ascending order.

3. Sort the array in descending order.

### Task 2 on Sorting Array

### 📝 Instructions:

1. Create a 2D NumPy with 3 rows and 3 columns as follows:
   ```python
   [[3, 1, 4],
    [1, 5, 9],
    [2, 6, 5]]
```

2. Sort each row of the array. Print the sorted array. The expected output is as follows:

[[1 3 4]
 [1 5 9]
 [2 5 6]]



3. Sort all elements globally (regardless of the row or column). Print the sorted array in 2-D. The expected output is as follows:


[[1  1  2]
 [3  4  5]
 [5  6  9]]

### Concatenation

In [8]:
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
c= np.concatenate((a, b))
print(c)

x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6]])
z= np.concatenate((x, y), axis=0)
print(z)

[1 2 3 4 5 6 7 8]
[[1 2]
 [3 4]
 [5 6]]


#### Array Operations

In [9]:
a = np.array([1,2,3,4,5,6])
b = np.array([2,4,6,8, 10, 12])
print (a+b)
print(a*b)

[ 3  6  9 12 15 18]
[ 2  8 18 32 50 72]


### Dimensions

**How do you know the shape and size of an array?**

- **ndarray.ndim**: number of axes, or dimensions, of the array
- **ndarray.size**: total number of elements of the array
- **ndarray.shape**: display a tuple of integers that indicate the number of elements stored along each dimension of the array

In [10]:
# Create a 2D NumPy array (2 rows, 3 columns)
# In NumPy, the number of dimensions is called "ndim", and each dimension is called an "axis"
k = np.array([[0, 0, 0], [1, 1, 1]])

# Print the number of dimensions (axes) — should be 2 (i.e., it's a 2D array)
print(k.ndim)  # Output: 2

# Print the total number of elements in the array (2 rows × 3 columns = 6)
print(k.size)  # Output: 6

# Print the shape of the array: (2, 3) # 2 rows and 3 columns
# → 2 elements along the first axis (rows), 3 elements along the second axis (columns)
print(k.shape)  # Output: (2, 3)

print(k)

2
6
(2, 3)
[[0 0 0]
 [1 1 1]]


#### Task3 on Array Shape: 

Come up with an array that has the shape (3,2,4). What is meant by each values in the shape in this case? Please explain

**Array Dimension Conversions**

How to add a new axis to an array? For instance, how can we convert a 1D array into a 2D array?


We can use the `reshape` method to change the number of rows and columns of an array.\
When reshaping, we should consider that the original matrix and the reshaped matrix must have the same number of elements.

#### Task 4: Reshape a 1D NumPy Array to 2D

- Convert the following 1D array to 2D numpy array of 2 rows and 5 columns.

  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

####  Question:
You are given a 1D NumPy array with 10 elements. Which of the following reshape dimensions are **valid**?

---

####  Choices (Select all that apply):

- [ ] (2, 5)
- [ ] (3, 3)
- [ ] (10, 1)
- [ ] (2, 1, 5)
- [ ] (2, 2, 2)


**Numpy Array Creation from Existing Data**

In [11]:
# convert list to np array
l = [1,2,3,4]
npa = np.asarray(l)
print(npa)
# from list of tuples
l = [(1,2,3),(4,5)]
# By default, the data type of input data is applied to the result array
# we can change data type with dtype parameter
npa = np.asarray(l, dtype= object)
print(npa)

[1 2 3 4]
[(1, 2, 3) (4, 5)]


In [12]:
s = "0123456"
#Creates a new 1-dimensional array from an iterable object
npa = np.fromiter(s, dtype=int)
print(npa)

[0 1 2 3 4 5 6]


### How can we measure or quantify the relationship between two variables ?


#### Covariance

Covariance is a statistical measure that describes the relationship between two variables.

- It indicates the **direction** of the linear relationship (positive or negative) between the variables.
- The variables can have **different units of measurement**
- Covariance estimates the **extent to which the two variables change together**:
  - A **positive covariance** means that as one variable increases, the other tends to increase as well.
  - A **negative covariance** means that as one variable increases, the other tends to decrease.
- It helps determine whether variables **move in the same or opposite direction**.

> ⚠️ However, covariance **does not indicate the strength** (degree) of the relationship.  
> For that, the **correlation coefficient** is a better measure.


The calculation of the sample covariance is as follows:

$$
\text{cov}(X, Y) = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{n - 1}
$$


#### What is Correlation?

**Correlation** refers to a statistical relationship between two variables — how they move or change in relation to one another.

Understanding correlation is important in data analysis and modeling because it helps identify potential connections between variables. These connections can exist for various reasons, such as:

- One variable may directly **influence or depend on** another.
- The variables might have a **weak or moderate association** without direct causation.
- Both variables could be **influenced by a third, unobserved factor**.

By examining correlations, we gain insights into the structure of the data, which can guide decisions in feature selection and interpretation of results.



- Correlation measures the **direction and strength** of the relationship between variables.
- The measure of correlation is called the **correlation coefficient**.
- The **correlation coefficient** (also known as **Pearson’s correlation coefficient**, denoted as *r*) measures the **strength of the linear relationship** between two continuous variables.
- The **degree of relationship** is expressed by a value called *r*, which satisfies:

  $$
  -1 \leq r \leq 1
  $$

  - \( r = 1 \): Perfect positive linear relationship  
  - \( r = -1 \): Perfect negative linear relationship  
  - \( r = 0 \): No linear relationship


> **The correlation coefficient `r` is equal to the covariance of `X` and `Y`, divided by the product of the standard deviations of `X` and `Y`.**


$$
r = \frac{\text{cov}(X, Y)}{S_X S_Y}
$$

<br>


Let's have a look at an example below and compute covariance and correlation between two `numpy` arrays.

In [13]:
import numpy as np
# Sample data
# Having X and Y variables
X = np.array([7, 8, 3, 6, 57, 11, 15 ])
Y = np.arange(7)


cov_matrix = np.cov(X, Y)  
print(cov_matrix)
cov_xy = cov_matrix[0, 1] 
print("\nCovariance:", cov_xy) #cov_matrix[0, 1] (or [1, 0]) is the actual covariance between X and Y



[[352.9047619   14.        ]
 [ 14.           4.66666667]]

Covariance: 14.0


⚠️ **Note:** The Covariance is the value that appears twice, - (the second and third number from left to right); This is because Python is actually calculating the covariance matrix, $cov(x)$, $cov(x,y)$, $cov(y,x)$, and $cov(y)$ in that order, we only need $cov(x,y)$, which happen to be the same as $cov(y,x)$

In [14]:
#this only for teachers (in case someone asks if covariance (x,x) = variance (x) so yes this is true
# but pay attention to ddof =1 because in sample , it should be 1. by default  np.var() computes population variance. 
# The covariance we have computed above is actually for sample. So variance should also be for sample (n-1).

print(np.var(X, ddof=1)) # need to set ddof=1 for sample. In case of covariance it (ddof) is already 1 by default for sample. 
print(np.var(Y, ddof=1)) # cov(x) = variance (x) and similarly cov(y) = variance (y)

352.90476190476187
4.666666666666667


In [15]:
corr_matrix = np.corrcoef(X, Y)
print(corr_matrix)
corr_xy = corr_matrix[0, 1]
print("Correlation coefficient (r):", corr_xy)

[[1.         0.34498156]
 [0.34498156 1.        ]]
Correlation coefficient (r): 0.3449815633402216


#### Task 5

A toy dataset of Dutch cities featuring the number of sports facilities and educational facilities. This small data is perfect to practice some statistical concepts. The objective of this task is to calculate and interpret the covariance and correlation coefficient.


| Dutch City    | Sports Facilities | Educational Facilities |
|:-------------:|:------------------:|:--------------------:|
| Leiden        | 23                 | 16                   |
| Maastricht    | 25                 | 26                   |
| Haarlem       | 17                 | 12                   |
| Nijmegen      | 38                 | 32                   |
| Eindhoven     | 40                 | 35                   |
| Tilburg       | 30                 | 26                   |
| Groningen     | 36                 | 36                   |
| Den Bosch     | 29                 | 32                   |
| Scheveningen  | 15                 | 8                    |
| Venlo         | 11                 | 14                   |


#### Use the above data to: compute the i) Covariance and ii) Correlation between the two variables (the number of sports facilities and the number of educational facilities)

#### Create a scatter plot of the number of sports facilities vs the number of educational facilities.



### Exploring the Wine Quality Dataset

We are going to load the Wine Quality Dataset from here: https://archive.ics.uci.edu/dataset/186/wine+quality 

Please study the variables present in this dataset (see the above url). 

The target variable is `quality` which refers to the quality of wine and is a score between 0 to 10. 

Most other variables can be used as features to predict the quality. However, before we start developing prediction models, we should gain a better understanding of the `features` and their relationship with the `target`. For this, we will utilize the **correlation coefficient**.

### How to load the Wine Quality Dataset?

- Go to: https://archive.ics.uci.edu/dataset/186/wine+quality
- Click on `Import in Python` and follow the instructions.

I have done the same below.


In [16]:
!pip install ucimlrepo # Install the ucimlrepo package



In [17]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
wine_quality = fetch_ucirepo(id=186) 
  
# data (as pandas dataframes) 
X = wine_quality.data.features # X has features
y = wine_quality.data.targets  # y has target

In [18]:
# Combine X and y into a single DataFrame df so we can compute the correlation matrix of all variables together
df = X.copy()              # Start with feature DataFrame
df['quality'] = y          # Add quality as a new column
df

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
6492,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6
6493,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5
6494,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6
6495,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7


#### Correlation Matrix and Heatmap

- A **correlation matrix** is a table that shows the **linear relationship** (correlation) between each pair of numerical variables in your dataset.
It helps us quickly understand which features are:
  - Strongly related to each other
  - Potentially redundant (e.g., if two variables are highly correlated)
  - Most related to the **target variable** (useful for prediction)

### 🌡️ Heatmap Visualization

- A **heatmap** is a graphical representation of the correlation matrix.
- It uses **color intensity** to show the strength and direction of correlations.
  - In this case (We can change the colors): 
    - 🔴 Red = Strong positive correlation
    - 🔵 Blue = Strong negative correlation
    - ⚪ Light/white = Weak or no correlation
- It includes **annotated numbers** to show the exact correlation values.
- Heatmaps make it **easy to spot patterns** at a glance.

In [19]:
# Compute correlation matrix
import matplotlib.pyplot as plt
import seaborn as sns # you might need to install the seaborn library

corr = df.corr()

# Plot heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix:  Wine Quality Dataset")
plt.tight_layout()
plt.show()

NameError: name 'plt' is not defined

#### Interpretation of the Heatmap

Take a close look at the **correlation heatmap** for the Wine Quality dataset.

### 🔍 Task 6:

1. **Identify the features** that are most strongly correlated with the target variable `quality`.
2. Which feature has the **highest positive correlation** with `quality`?
3. Are there any **features that are strongly correlated with each other** (not including `quality`)?
4. Any other interesting observations you notice?

###  Exploratory Data Analysis on a Crime Dataset

The dataset consists of observations from the year 1987 for the crime rate in North Carolina. The State consists of counties. The dataset is available here: https://raw.githubusercontent.com/rnanda17/data_science_BE/refs/heads/main/crime_1987.csv . The data is aggregated by county. 

The dataset has been taken from this paper (http://qed.econ.queensu.ca/jae/datasets/baltagi003/) by only selecting for the year 1987. 

A brief description of various variables in the dataset is presented below:

county

    county identifier
year

    year = 1987
crmrte

    crimes committed per person
prbarr

    'probability' of arrest
prbconv

    'probability' of conviction
prbpris

    'probability' of prison sentence
avgsen

    average sentence, days
polpc

    police per capita
density

    hundreds of people per square mile
taxpc

    tax revenue per capita
west

    'west' = 1, if region is west for the State 
    
central

    'central' = 1, if region is central for the State

urban

    'urban' = 1 if in SMSA (Standard Metropolitan Statistical Area)
    
pctmin80

    percentage minority in 1980
    
wcon

    weekly wage in construction
wtuc

    weekly wage in trns, util, commun
wtrd

    weekly wage in whole sales and retail trade
wfir

    weekly wage in finance, insurance and real estate
wser

    weekly wage in service industry
wmfg

    weekly wage in manufacturing
wfed

    weekly wage of federal employees
wsta

    weekly wage of state employees
wloc

    weekly wage of local governments employees
mix

    offense mix: face-to-face/other
pctymle

    percentage of young males

<br>


#### Task 7

The target (or dependent) variable of interest is crime rate which is represented by `crmrte` (in theory/future we would like to use a machine learning algorithm like <strong>Linear Regression</strong> to predict the the crime rate, `crmrte`. Your goal is to find out which are the key variables (features) with most strong association with `crmrte`. This is called as feature selection phase in machine learning. You want to take only the most relevant features to predict the value of `crmrte`. Therefore, you must


- Do a exploratory numerical and visual data analysis on the data variables
- Explore and plot the relationships between various independent variables and the target variable: `crmrte`

Once you’ve completed your analysis, summarize:
- Which features could be most relevant for predicting `crmrte`
- Which variables show problematic outliers.
- For which variables it might be justified to keep the outliers (e.g., because they represent real-world extreme cases rather than errors).
- Any insights in the relationships between variables


Hints: 

You can perform correlation analysis to find out variables with strong association with the target `crmrte`. You can also use scatter plots to assess the relationship of feartures and the target. 

Then you can focus on plots such as boxplots (to check for outliers) or/and histograms (for data distribution). 

You are encouraged to go beyond the suggested methods.  Feel free to explore other plots or techniques as this is an open-ended analysis.
