***
**Author:** Josiah Wallis \
Created for use in CS/STAT108: Data Science Ethics (UCR - Winter 2024)
***

# Python and Data Science Libraries
This notebook covers important built-in Python functionality used often in data science as well as fundamental data science libraries. It will **not** be graded.

***
# Lambda Functions
**Lambda functions** are "anonymous," one-line functions that are versatile due to their brevity and simplicity when defining. They are commonly used as function arguments as we'll see later. Syntax: \
```
var = lambda param: #code involving param
var(x)
```
Lambda functions can be assigned to variables, as shown above, and the variable can be called like a function. These functions can have multiple parameters or none at all, and they automatically return values without the `return` keyword when contextually appropriate.

In [3]:
hello = lambda name: print(f'Hello {name}!')
hello('Josiah')

Hello Josiah!


In [4]:
add = lambda x, y: x + y
sum = add(3, 4)
sum

7

In [5]:
hello_world = lambda: print('Hello World')
hello_world()

Hello World


***
# Functional Programming
Many functions take other functions as arguments! We'll see a few of these when working with the **numpy** and **pandas** libraries for data manipulation, but Python has some built-in functions and methods that use this methodology. Let's see some examples below.

In [6]:
words = ['test', 'dog', 'hi', 'cheesecake', 'crabby']
words.sort(key = len)
words

['hi', 'dog', 'test', 'crabby', 'cheesecake']

In [7]:
words_titled = map(str.title, words)
for word in words_titled:
  print(word)

Hi
Dog
Test
Crabby
Cheesecake


In [8]:
words_reversed = map(lambda x: x[::-1], words)
for word in words_reversed:
  print(word)

ih
god
tset
ybbarc
ekaceseehc


In [9]:
words_len_filter = filter(lambda x: len(x) <= 4, words)
for word in words_len_filter:
  print(word)

hi
dog
test


In [10]:
words_starts_filter = filter(lambda x: x.startswith('c'), words)
for word in words_starts_filter:
  print(word)

crabby
cheesecake


***
# Numpy and Pandas
Two essential data science libraries used for manipulating data and dataframes are `numpy` and `pandas`. `numpy` forms the foundational tools and data structures for working with vectorized data while `pandas` extends `numpy` by providing a higher-level dataframe tool to both manipulate and view dataframes in a "friendlier" form.

## Numpy

### ndarray

In [11]:
# Importing libraries
import numpy as np

In [12]:
x = np.array([1, 2, 3, 4])
x

array([1, 2, 3, 4])

In [13]:
x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
x

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [14]:
print(f'ndarray data type: {x.dtype}\nndarray dimensions: {x.ndim}\nndarray shape: {x.shape}')

ndarray data type: int32
ndarray dimensions: 2
ndarray shape: (3, 3)


***
### Basic Data Generation

In [15]:
x = np.arange(25)
x

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

In [16]:
x = np.zeros(10)
y = np.zeros((2, 5))
x

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [17]:
y

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

***
### Vectorization
ndarrays are not like normal lists. Suppose you have a 2D list of numbers and you want to perform arithmetic on all the values in the list. Normally, you'd have to loop through the values and operate on each element. **Vectorization** allows you to perform the operation on the ndarray itself and it will perform the operation element-wise.

In [18]:
x = np.arange(25).reshape(5, 5)
x

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [19]:
x + 15

array([[15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34],
       [35, 36, 37, 38, 39]])

In [20]:
x > 15

array([[False, False, False, False, False],
       [False, False, False, False, False],
       [False, False, False, False, False],
       [False,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

***
### Basic Indexing, Slicing, and Broadcasting

In [21]:
x

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [22]:
x[0][2]

2

In [23]:
x[:3, 3]

array([ 3,  8, 13])

In [24]:
x[:3, 1:]

array([[ 1,  2,  3,  4],
       [ 6,  7,  8,  9],
       [11, 12, 13, 14]])

In [25]:
x[:, 2:]

array([[ 2,  3,  4],
       [ 7,  8,  9],
       [12, 13, 14],
       [17, 18, 19],
       [22, 23, 24]])

In [26]:
x[2:, :]

array([[10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

Broadcasting

In [27]:
x[2:, :] = 10
x

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 10, 10, 10, 10],
       [10, 10, 10, 10, 10],
       [10, 10, 10, 10, 10]])

Slices are not independent objects. They are actual "views" or "windows" into the original object. When you change the slice, you change the original ndarray.

In [28]:
row_slice = x[2]
row_slice

array([10, 10, 10, 10, 10])

In [29]:
row_slice[1:4] = 5
row_slice

array([10,  5,  5,  5, 10])

In [30]:
x

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10,  5,  5,  5, 10],
       [10, 10, 10, 10, 10],
       [10, 10, 10, 10, 10]])

ndarray.copy()

In [31]:
row_slice_cpy = row_slice.copy()
row_slice_cpy

array([10,  5,  5,  5, 10])

In [32]:
row_slice_cpy[:] = 50
row_slice_cpy

array([50, 50, 50, 50, 50])

In [33]:
x

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10,  5,  5,  5, 10],
       [10, 10, 10, 10, 10],
       [10, 10, 10, 10, 10]])

***
### Boolean Indexing

In [34]:
x = np.arange(25).reshape(5, 5)
x

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [35]:
x < 15

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [False, False, False, False, False],
       [False, False, False, False, False]])

In [36]:
x[x < 15]

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [37]:
x[(x % 2) == 0]

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24])

In [38]:
x[~((x % 2) == 0)]

array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23])

In [39]:
x[((x % 2) == 0) & (x > 15)]

array([16, 18, 20, 22, 24])

***
Numpy has a whole suite of functions and methods available to it. If you'd like to check out more of its functionality, like matrix operations, random number generation, and more, head to this [link](https://numpy.org/doc/stable/reference/)!
***

## Pandas

### Dataframes

In [40]:
import pandas as pd
from sklearn import datasets

In [41]:
# Loading a dataset from csv
data = pd.read_csv('data.csv')
data.columns

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

In [None]:
shape = data.shape
print(f'{shape[0]} datapoints with {shape[1]} features')

In [None]:
# views first 5 datapoints (default)
data.head()

In [None]:
# single feature
data['dt']

In [None]:
# multiple features
data[['dt', 'AverageTemperature']]

In [None]:
# accessing datapoints
data.loc[0]

In [None]:
# accessing multiple datapoints
data.loc[1:5]

***
### Boolean Indexing

In [None]:
data[data['AverageTemperature'] > 25]

In [None]:
data[data['City'] != 'Abidjan']

# Descriptive Statistics and Useful Functions/Methods

mean()

In [None]:
data['AverageTemperature'].mean()

std()

In [None]:
data['AverageTemperature'].std()

var()

In [None]:
data['AverageTemperature'].var()

describe()

In [None]:
# summary statistics for numeric columns. count -> number of non-NA rows
data.describe()

corr()

In [None]:
data.corr()

np.unique()

In [None]:
np.unique(data['City'])

***
Pandas is just as in-depth as numpy. Check out this [user guide](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) to see what else it has to offer!
***