# Why python for data analysis, machine learning?
There are lots of reasons that we want to use python for doing data science. It is certainly one of the younger programming languages used in the data science ecosystem (compared to say R and SAS) but it is used just as frequently for analysis as SAS and R. Having a good foundation in python and R, (and SAS or SPSS) should be a *must* for **every data scientist** and machine learning enthusiast. 

In this course, python allows for an open source method of performing machine learning that runs from just about any machine. So let's start with looking at Numpy and Pandas pachages for analyzing data. 

With that in mind, let's go over the following:
- Numpy matrices
- Simple operations on arrays and matrices
- Indexing with numpy
- Pandas for tabular data
- Representing categorical data (discussion point)

In [1]:
import sys
import numpy as np

print(sys.version)
print(np.__version__)

3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]
1.20.3


In [2]:
x = np.random.rand(5,3)
x

array([[0.87863441, 0.78331119, 0.6186509 ],
       [0.59426703, 0.6016146 , 0.80505423],
       [0.49038633, 0.8889456 , 0.4963617 ],
       [0.18962321, 0.37103697, 0.56821692],
       [0.14830413, 0.29774338, 0.04841899]])

In [3]:
x.shape

(5, 3)

In [4]:
x.dtype

dtype('float64')

In [5]:
# will this work?
y = np.random.rand(3,4)
z = x*y
z

ValueError: operands could not be broadcast together with shapes (5,3) (3,4) 

In [7]:
# we can designate what matrix multiplication is directly using objects
z = np.dot(x,y)
z

array([[1.11659734, 0.62599526, 0.66802419, 0.80348237],
       [1.10612416, 0.55816319, 0.49950583, 0.66027405],
       [0.93820819, 0.57681172, 0.44945996, 0.54467592],
       [0.68642334, 0.33663422, 0.2143774 , 0.31924087],
       [0.21150515, 0.15870856, 0.12992588, 0.1396975 ]])

In [8]:
# or we can use the overloaded matrix multiplication operator
z = x @ y
z

array([[1.11659734, 0.62599526, 0.66802419, 0.80348237],
       [1.10612416, 0.55816319, 0.49950583, 0.66027405],
       [0.93820819, 0.57681172, 0.44945996, 0.54467592],
       [0.68642334, 0.33663422, 0.2143774 , 0.31924087],
       [0.21150515, 0.15870856, 0.12992588, 0.1396975 ]])

# Indexing

In [9]:
x1 = np.array([[1,2,3],
               [4,5,6],
               [7,8,9]])
x1

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [10]:
for row in range(x1.shape[0]):
    print(x1[row,1])

2
5
8


In [16]:
print(x1[:,1])
print(x1[:,1]>3)
# slicing
print(x1[ x1[:,1]!=2 ])

[2 5 8]
[False  True  True]
[[4 5 6]
 [7 8 9]]


In [17]:
x2 = np.array(range(10))
print(x2)
x2.shape

[0 1 2 3 4 5 6 7 8 9]


(10,)

In [18]:
idx = x2>5
print(idx)
print(x2[idx])

[False False False False False False  True  True  True  True]
[6 7 8 9]


In [19]:
x2[x2>5] # rows of x2 where x2 is greater than 5

array([6, 7, 8, 9])

# Named columns
So what if we have a matrix of data where each row is some observation of features and the feature values are represented in each column?

In [20]:
col_names = ['temperature','time','day']
data = np.array([[64,2100,1],
                 [50,2200,4],
                 [48,2300,3],
                 [34,0,   2],
                 [30,100, 5]])
data

array([[  64, 2100,    1],
       [  50, 2200,    4],
       [  48, 2300,    3],
       [  34,    0,    2],
       [  30,  100,    5]])

In [21]:
data2 = data[data[:,1]>1500]
data2

array([[  64, 2100,    1],
       [  50, 2200,    4],
       [  48, 2300,    3]])

In [22]:
# pandas to the rescue
import pandas as pd

df = pd.DataFrame(data,columns=col_names)
df

Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,4
2,48,2300,3
3,34,0,2
4,30,100,5


In [23]:
# can always access the backend numpy with .values
print(type(df.to_numpy()))
df.to_numpy()

<class 'numpy.ndarray'>


array([[  64, 2100,    1],
       [  50, 2200,    4],
       [  48, 2300,    3],
       [  34,    0,    2],
       [  30,  100,    5]])

In [24]:
df[df.time>1500]

Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,4
2,48,2300,3


In [25]:
# lets get a description of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   temperature  5 non-null      int32
 1   time         5 non-null      int32
 2   day          5 non-null      int32
dtypes: int32(3)
memory usage: 188.0 bytes


In [26]:
df.day[df.day==1] = 'Mon'
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,4
2,48,2300,3
3,34,0,2
4,30,100,5


In [27]:
# there is almost always a more efficient built in pandas function
df.day.replace(to_replace=range(7),
               value=['Su','Mon','Tues','Wed','Th','Fri','Sat'],
               inplace=True)
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,Th
2,48,2300,Wed
3,34,0,Tues
4,30,100,Fri


In [28]:
# notice how the type of the column has changed to an object "categorical"
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   temperature  5 non-null      int32 
 1   time         5 non-null      int32 
 2   day          5 non-null      object
dtypes: int32(2), object(1)
memory usage: 208.0+ bytes


In [29]:
# one hot encoding example
pd.get_dummies(df.day)

Unnamed: 0,Fri,Mon,Th,Tues,Wed
0,0,1,0,0,0
1,0,0,1,0,0
2,0,0,0,0,1
3,0,0,0,1,0
4,1,0,0,0,0


# Some Pandas Syntax

In [30]:
# slicing into a pandas dataframe
print(df.day)
print(df['day'])
df[['day','temperature']]

0     Mon
1      Th
2     Wed
3    Tues
4     Fri
Name: day, dtype: object
0     Mon
1      Th
2     Wed
3    Tues
4     Fri
Name: day, dtype: object


Unnamed: 0,day,temperature
0,Mon,64
1,Th,50
2,Wed,48
3,Tues,34
4,Fri,30


In [31]:
print(df.day[2])
print(df.day[2:])

Wed
2     Wed
3    Tues
4     Fri
Name: day, dtype: object


In [32]:
# index location
df.iloc[3:]

Unnamed: 0,temperature,time,day
3,34,0,Tues
4,30,100,Fri


In [33]:
df.iloc[3:][['day','temperature']]

Unnamed: 0,day,temperature
3,Tues,34
4,Fri,30


In [34]:
df[['day','temperature']].iloc[3:]

Unnamed: 0,day,temperature
3,Tues,34
4,Fri,30


In [35]:
df.mean()

  df.mean()


temperature      45.2
time           1340.0
dtype: float64

In [36]:
df.std()

  df.std()


temperature      13.608821
time           1180.254210
dtype: float64

In [37]:
df.mean()/df.std()

  df.mean()/df.std()


temperature    3.321375
time           1.135349
dtype: float64

In [38]:
df.time.unique()

array([2100, 2200, 2300,    0,  100])

# Pandas Block Manager
Let's take a look at some important points from the following post:
 - https://uwekorn.com/2020/05/24/the-one-pandas-internal.html

This is the pandas BlockManager, which tries to group internal structures together to make things fast:
<img src="https://uwekorn.com/images/pd-df-perception.002.png" width=200 height=200 />

In [39]:
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,Th
2,48,2300,Wed
3,34,0,Tues
4,30,100,Fri


In [40]:
print(df._data.nblocks)
df._data

2


BlockManager
Items: Index(['temperature', 'time', 'day'], dtype='object')
Axis 1: RangeIndex(start=0, stop=5, step=1)
NumericBlock: slice(0, 2, 1), 2 x 5, dtype: int32
ObjectBlock: slice(2, 3, 1), 1 x 5, dtype: object

## Advantages and disadvantages:
This can speed up operations because it inhenertly can apply operations along columns in a single pass over the data (like sums, etc.) and therefore is using c++ for much of the heavy lifting.

But, **it might be bad** when you are adding columns to the data because it can trigger consolidation of columns, which means copying over data in numpy to creata new matrix. The slow down also doesn't show up until a needed column is accessed (lazy data copying). Let's do an example from:  https://uwekorn.com/2020/05/24/the-one-pandas-internal.html

**Block consolidation is triggered after 100 blocks of data are reached.**

In [41]:
# we will start with a 2 column dataframe
# one column is an int and the other a float
# becasue there are two datatypes this has two blocks
df_example = pd.DataFrame({
    'int64': np.arange(1024 * 1024, dtype=np.int64),
    'float64': np.arange(1024 * 1024, dtype=np.float64),
})
df_example

Unnamed: 0,int64,float64
0,0,0.0
1,1,1.0
2,2,2.0
3,3,3.0
4,4,4.0
...,...,...
1048571,1048571,1048571.0
1048572,1048572,1048572.0
1048573,1048573,1048573.0
1048574,1048574,1048574.0


In [42]:
%%time 

# but now lets start to add columns one by one
# to be fast, pandas adds each as a new block 
# so we will have 99 blocks (2+97 new ones)
for i in range(97):
    df_example[f'new_{i}'] = df_example['int64'].to_numpy()
    
print(df_example._data.nblocks)
df_example

99
Wall time: 202 ms


Unnamed: 0,int64,float64,new_0,new_1,new_2,new_3,new_4,new_5,new_6,new_7,...,new_87,new_88,new_89,new_90,new_91,new_92,new_93,new_94,new_95,new_96
0,0,0.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1.0,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
2,2,2.0,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
3,3,3.0,3,3,3,3,3,3,3,3,...,3,3,3,3,3,3,3,3,3,3
4,4,4.0,4,4,4,4,4,4,4,4,...,4,4,4,4,4,4,4,4,4,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1048571,1048571,1048571.0,1048571,1048571,1048571,1048571,1048571,1048571,1048571,1048571,...,1048571,1048571,1048571,1048571,1048571,1048571,1048571,1048571,1048571,1048571
1048572,1048572,1048572.0,1048572,1048572,1048572,1048572,1048572,1048572,1048572,1048572,...,1048572,1048572,1048572,1048572,1048572,1048572,1048572,1048572,1048572,1048572
1048573,1048573,1048573.0,1048573,1048573,1048573,1048573,1048573,1048573,1048573,1048573,...,1048573,1048573,1048573,1048573,1048573,1048573,1048573,1048573,1048573,1048573
1048574,1048574,1048574.0,1048574,1048574,1048574,1048574,1048574,1048574,1048574,1048574,...,1048574,1048574,1048574,1048574,1048574,1048574,1048574,1048574,1048574,1048574


In [43]:
%time df_example['dummy_name3'] = df_example['int64'].values # copy over some new columns
print('Number of blocks in data:',df_example._data.nblocks)

%time df_example['dummy_name4'] = df_example['int64'].values # copy over some new columns
print('Number of blocks in data:',df_example._data.nblocks)


Wall time: 1.99 ms
Number of blocks in data: 100
Wall time: 1.99 ms
Number of blocks in data: 101




In [44]:
df_example.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048576 entries, 0 to 1048575
Columns: 101 entries, int64 to dummy_name4
dtypes: float64(1), int64(100)
memory usage: 808.0 MB
