# COSC 311: Introduction to Data Visualization and Interpretation

Instructor: Dr. Shuangquan (Peter) Wang

Email: spwang@salisbury.edu

Department of Computer Science, Salisbury University


# Module 2_Data Processing and Organization

## 2. Numerial tools



**Contents of this note refer to 1) Dr. Joe Anderson's teaching materials; 2) textbook "Data Science from Scratch"; 3) https://www.w3schools.com/python/numpy/default.asp**

**<font color=red>All rights reserved. Dissemination or sale of any part of this note is NOT permitted.</font>**

# NumPy

https://numpy.org/

NumPy stands for Numerical Python. NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays (https://en.wikipedia.org/wiki/NumPy).

Functions in NumPy: https://numpy.org/doc/stable/reference/routines.html

Example:

numpy.arange([start, ]stop, [step, ]dtype=None, *, like=None)

- return evenly spaced values within a given interval.

## Create arrays using Numpy

Refer to https://www.w3schools.com/python/numpy/numpy_creating_arrays.asp

NumPy is used to work with arrays. The array object in NumPy is called *ndarray*. A NumPy *ndarray* object can be created by using the **array()** function.

In [None]:
import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)

print(type(arr))

array() function can convert a list, tuple or any array-like object into an array.

In [None]:
# e.g. tuple

arr = np.array((1, 2, 3, 4, 5))
print(arr)

## 2-D Arrays

An array that has 1-D arrays as its elements. 2-D arrays are often used to represent matrix.

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)

Create a 3-D array with two 2-D arrays, both containing two arrays with the values 1,2,3 and 4,5,6:

In [None]:
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)

NumPy provides the **ndim** attribute that returns an integer represents the dimensions of the array.

In [None]:
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])

print(b.ndim)
print(c.ndim)


## Access Array Elements

You can access an array element by referring to its index number.
To access elements from 2-D arrays you can use comma separated indices for row and column.

In [None]:
arr = np.array([1, 2, 3, 4])
print(arr[0])

In [None]:
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1])

Iterating arrays

In [None]:
arr = np.array([1, 2, 3])

for x in arr:
  print(x)

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6]])

for x in arr:
  print(x)

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6]])

for x in arr:
  for y in x:
    print(y)

## Sorting Arrays

The NumPy ndarray object has a function called sort(), that will sort a specified array.

In [None]:
arr = np.array([3, 2, 0, 1])

print(np.sort(arr))

## Calculation between two arrays

In [None]:
x = np.array([1,2,3,4])
y = np.array([4,5,6,7])

In [None]:
# array + array
print(x + y)

In [None]:
# array + number
print(x + 2)

It is different with list calculation

In [None]:
# list + list
x = [1,2,3,4]
y = [4,5,6,7]
z = x + y
print(z)

In [None]:
# list + number (syntax error)
print(x + 2)

Other array operations:

In [None]:
x = np.array([1,2,3,4])

np.sqrt(x)

In [None]:
x = np.array([1,2,3,4])

x * 10

Array * Array is element-wise multiplication

In [None]:
A = np.array([
    [2,3,4],
    [3,1,0],
    [0,1,1]
])

x = np.array([1,3,4])

In [None]:
A * x  # will broadcast the 1x3 into a 3x3 by vertically copying,
       # then does element-wise multiplication

How to do matrix multiply with matrix?

In [None]:
A = np.arange(1,5).reshape(2,2)
print(A)
B = np.arange(0,4).reshape(2,2)
print(B)
print(type(A))
print(type(B))

In [None]:
A * B   # array multiplication: element-wise multiplication

In [None]:
C = np.mat(A)
print(C)
D = np.mat(B)
print(D)
print(type(C))
print(type(D))

In [None]:
# matrix multiplication
C * D

# pandas

pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. -- Wikipedia

In [None]:
import pandas as pd # our main priority here
import numpy as np  # just in case, but also comes along with pandas
from matplotlib import pyplot as plt # for plotting

In [None]:
# dictionary that holds the wins, draws, and losses for three 
# football (soccer) teams over three years. This is realized 
# in a simple way by just doing nine rows, where each row has
# the year, team, wins, draws, and losses.

football_data = { 
    'year': [
        2010, 2011, 2012,
        2010, 2011, 2012,
        2010, 2011, 2012
        ],
    'team': [
        'FCBarcelona', 'FCBarcelona',
        'FCBarcelona', 'RMadrid',
        'RMadrid', 'RMadrid',
        'ValenciaCF', 'ValenciaCF',
        'ValenciaCF'
    ],
    'wins':[30 , 28 , 32 , 29 , 32 , 26 , 21 , 17 , 19] ,
    'draws': [6 , 7 , 4 , 5 , 4 , 7 , 8 , 10 , 8] ,
    'losses': [2 , 3 , 2 , 4 , 2 , 5 , 9 , 11 , 11]
}

# The task: now we want to manipulate and present this data

In [None]:
# how many wins did FCB have in 2010?
print(football_data['wins'][0]) 

# how many wins did RMadrid have in 2011?
print(football_data['wins'][4])

print(football_data)

The main object of interest here is called a DataFrame. Refer to the [User Guide](https://pandas.pydata.org/docs/user_guide/index.html) of pandas:

https://pandas.pydata.org/docs/reference/frame.html

In [None]:
#football_df = pd.DataFrame(football_data, columns=['year', 'team', 'wins', 'draws', 'losses'])
football_df = pd.DataFrame(football_data)
print(football_df) # show the object as plain text

In [None]:
football_df

In [None]:
football_df.head(2)

In [None]:
football_df[2:] # we can slice similar to python lists!
                # gets every row from index 2 onward

In [None]:
football_df['team'] # get a whole column

In [None]:
football_df[ ['team', 'wins', 'draws'] ] # get three whole columns

In [None]:
# add a new column
football_df['m_index'] = np.array(list(football_df.index)) + 1
football_df

In [None]:
# run some basic statistics on the numerical columns in the dataframe
football_df.describe()

In [None]:
# basic stats operations can be run over dataframes
means = football_df[['wins', 'draws']].mean() # gives a dictionary-like result object
print(means)
type(means) # pandas.core.series.Series

In [None]:
# can also unzip: wins, draws to extract both in parallel
means = football_df[['wins']].mean()
print(means)
type(means)

In [None]:
means = football_df[['draws']].mean()
print(means)
type(means)

In [None]:
# pay attention to the difference
means = football_df['wins'].mean()
print(means)
type(means)

Why?

In [None]:
type(football_df[['wins']])

In [None]:
type(football_df['wins'])

In [None]:
# extract all the RM data:
football_df['team'] == 'RMadrid' 
# results in a list/sequence of booleans, after doing element-wise testing!

print(football_df['team'] == 'RMadrid')


In [None]:
# we can select rows by using boolean indexing
# so combining these ideas:
football_df[ football_df['team'] == 'RMadrid' ]

In [None]:
# also combining booleans lists/series with bitwise-operators
print(football_df['team'] == 'RMadrid')
print(football_df['team'] == 'ValenciaCF')

# use bitwise OR (the single pipe, |) to combine:
print( (football_df['team'] == 'RMadrid') | (football_df['team'] == 'ValenciaCF') )

In [None]:
football_df[ (football_df['team'] == 'RMadrid') | (football_df['team'] == 'ValenciaCF') ]

In [None]:
football_df[football_df['team'] == 'FCBarcelona']

In [None]:
# DataFrames have built-in plotting!
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html

football_df[football_df['team'] == 'FCBarcelona'].plot.bar(x='year', y='wins', 
   color=['b', 'r', 'g'], 
   title='FC Barcelona Wins from 2010 to 2012',
   xlabel='Year',
   ylabel='Wins'
   )

In [None]:
football_df[(football_df['team'] == 'FCBarcelona') | (football_df['team'] == 'ValenciaCF')].plot.bar(x='year', y='wins', 
   color=['b', 'r', 'g'], 
   title='Barcelona & Valencia Wins from 2010 to 2012',
   xlabel='Year',
   ylabel='Wins'
   )

In [None]:
fcb_vcf_wins = football_df[(football_df['team'] == 'FCBarcelona') \
                           | (football_df['team'] == 'ValenciaCF')][['year','team','wins']]
fcb_vcf_wins

In [None]:
# DataFrame.pivot: Return reshaped DataFrame organized by given index / column values.
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html

fcb_vcf_wins_pivoted = fcb_vcf_wins.pivot(index='year', columns='team', values='wins')
fcb_vcf_wins_pivoted

In [None]:
fcb_vcf_wins_pivoted.plot(kind='bar')

Take CSV file as the data source

In [None]:
# The file path is ***RELATIVE TO THIS NOTEBOOK***
# https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
housing_data = pd.read_csv('housing.csv')

In [None]:
housing_data.info()

In [None]:
housing_data['ocean_proximity']

In [None]:
housing_data['ocean_proximity'].unique()

# Using and Analyzing Data

## Questions that the housing data could potentially answer:

- What is the average amount of bedrooms in houses?
- What is the average distance from the ocean?
- How does income/house value correspond to ocean proximity?
- Does location (ocean proximity) affect size of house or number of bedrooms?
- Breakdown of attributes per-state (size, income, etc) using lat/long
- Where do people retire?


In [None]:
housing_data.describe()

In [None]:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.scatter.html

housing_data.plot.scatter(x = 'longitude', y = 'latitude')

In [None]:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.hist.html

housing_data.plot.hist(by='median_house_value', bins=10)

In [None]:
housing_data['median_house_value'].plot(kind='hist',bins=100)

In [None]:
housing_data['ocean_proximity'].count()

In [None]:
housing_data[['median_house_value','ocean_proximity']].\
groupby('ocean_proximity').count().plot(kind='bar')