# `numpy` and `pandas` - Python data science libraries

**Author: Trevor Faske (updated by T. Parchman)  
Modified: 11/11/2023**

`pandas` is a must learn tool for data science. It is a powerful python package and swiss army knife for all data analysis. "The name is derived from the term 'panel data', an econometrics term for data sets that include observations over multiple time periods for the same individuals. Also a play on the phrase 'Python data analysis.'" - wikipedia

`pandas` works with the data structure called **DataFrame** (same as in R). This consists of a matrix with rows and columns and will very similar to an excel spreadsheet or csv file. `pandas` allows you to easily manipulate, filter, summarize, and merge data for downstream processing. `pandas` is part of the SciPy (https://www.scipy.org/) ecosystem so works great for plotting and data analysis. 

`numpy` is a widely used python library that supports working with large, multi-dimensional arrays and matrices. It has a diversity of mathematical functions to operate on these arrays or matrices. `pandas` was built from `numpy`, so requires or resembles much of its functionality.

## Useful Resources

- https://github.com/jvns/pandas-cookbook
- https://www.w3schools.com/python/ (great NumPy intro)
- https://pandas.pydata.org/docs/getting_started/tutorials.html (community tutorials)
- https://pandas.pydata.org/docs/user_guide/index.html
- https://www.kaggle.com/learn/pandas  
- https://blog.resolvingpython.com/01-getting-started-with-pandas

## Installing libraries

Python has substantial built in data structures and functions, but external libraries (packages) have an insane diversity of workflows and functions. Most of the growing flexibility and tool kit associated with python come from these libraries.

Most will have have `pip3` or `conda` associated commands available and these can used to install needed libraries from the terminal. 

**If using pip3**:  

`$ pip3 install numpy`   
`$ pip3 install pandas`
    
You might get a permissions error. If so, install like:  
`$ pip3 install --user pandas`

**If using conda**:  

`$ conda install -c anaconda numpy`  
`$ conda install -c anaconda pandas`


## Getting started with NumPy

#### Resource: https://www.w3schools.com/python/numpy_intro.asp

`NumPy` is a popular array – processing package that also allows for a lot of rapid mathematical operations (i.e., matrix math). Everything is array/matrix based and are designed to work much faster than compareable tasks with base python lists. `pandas` uses much of the same syntax, so it is useful, if not necessary, to know one for the other.

### importing library


In [2]:
import numpy as np #np used as a shorter alias. I suggest always doing this

### ndarrays

#### Create 1-D array with `.array()` function
Array is kind of just another word for list. As in base python, the array is given as a list of values enclosed in `[]`.

`type` returns the type of numpy object. IN the case below, an ndarray.

In [3]:
d1 = np.array([1,3,5,2,4,6])
print(d1)
type(d1)

[1 3 5 2 4 6]


numpy.ndarray

#### Create 2-D array with `.array()` function

This is essentially a list of lists, essentially a 2-dimensional data frame of rows with their values being in columns.

In [4]:
d2 = np.array([[1,3,5],[2,4,6]])
print(d2)
type(d2)

[[1 3 5]
 [2 4 6]]


numpy.ndarray

#### Get dimensions and total size of any `ndarray`

In [3]:
print(d2.shape) #rows, columns
print(d2.size) 
d2.shape

NameError: name 'd2' is not defined

### Accessing and indexing `numpy.ndarrays` is similar to lists but with added dimension. Very similar to R indexing of data frames.

This means that `0` is the first index in an array. For a 2-dimensional `numpy.ndarray`, rows are given with the first index, and columns are given with the second: `array_name[row,column]`. This is so similar to dataframes in R, that if you know one the other is quite easy.

In [4]:
print(d2)

# 2nd row, 1st column 
print(d2[1,0]) 

# 1st row, 3rd column
print(d2[0,2])

NameError: name 'd2' is not defined

#### Slicing works very similarly to lists in base python. 

This is similar to extracting part of a single row, or collection of rows, or part of a single column, or collection of columns.


In [5]:
#extract first 2 elements of the 2nd row
print(d2[1,:2])
print(d2)

NameError: name 'd2' is not defined

### Reshape array format with `.reshape()`

Reshaping an array involves changing its dimensions. For example, we might want to turn a long 1-d array into a 2-d array. 

The code below takes a 1-d array and turns into a 2-d array with 2 rows, and 3 columns.

In [6]:
d1 = np.array([1,3,5,2,4,6]) 
print(d1)
d1.reshape(2,3) # 2 rows, 3 columns
d1.reshape(3,2) # 3 rows, 2 columns


[1 3 5 2 4 6]


array([[1, 3],
       [5, 2],
       [4, 6]])

The code below takes a 2-d array  with 2 rows, and 3 columns and turns it into a 1-d array.

In [7]:
d2 = np.array([[1,3,5],[2,4,6]])
d2.reshape(1,6)  

array([[1, 3, 5, 2, 4, 6]])

### create 1D array containing a range of values using the `.arange()` function. 

This function simply returns "A Range" of values.

In [8]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

#### array 0 to 50 by 5 (start,stop,step)

In [9]:
np.arange(0,50,5)

array([ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45])

### Random number generation with `.random()` library of `numpy`

Very useful for permutation techniques, simulating data, or anything else you might want random numbers for.

Import `random` from `numpy` as below:

In [10]:
from numpy import random

#### Generate a random float from 0 to 1 with `.rand()` function

In [11]:
random.rand()

0.1717424004934437

#### Generate 5 random floats from 0 to 1

In [12]:
random.rand(5)

array([0.42929298, 0.71318927, 0.36047879, 0.35215489, 0.83002852])

#### Generate one random integer between 0-99 with `.randint()` 

In [13]:
random.randint(100)

88

#### Generate 7 random integers between 0-99 

In [14]:
random.randint(100, size = 7)

array([29, 15, 59, 39, 60, 30, 10])

#### Generate 2D array of random integers between 0-99

In [15]:
random.randint(100, size=(2,3))

array([[93, 40, 54],
       [50, 21,  7]])

### Choose or randomly sample from an array

#### Take a random sample from an array

In [16]:
random.choice([3, 5, 7, 9])

9

#### Sample with replacement from list

The code below samples repeatedly, with replacement, from the list until an 2-d array with 3 rows and 4 columns is built.

In [17]:
random.choice([3, 5, 7, 9],size=(3,4))

array([[3, 3, 9, 9],
       [5, 7, 7, 7],
       [7, 9, 3, 9]])

## Math in `numpy`

(https://numpy.org/doc/stable/reference/routines.math.html)

### Generate 100 random numbers from 1 to 1000 and get length, max, min, mean

In [18]:
x = random.randint(1000,size=100)

print(len(x))
print(x.max())
print(x.min())
print(x.mean())


100
996
3
516.51


### Mathematical operations can easily and rapidly be conducted on each element of an array. 

Below, we are making a 1-d array of ten random integers. We are then using `np.divide` to divide each element by 2, `np.power` to raise each element to the power of 2, and `np.sqrt` to take the square root of each element. 

In [19]:
x = random.randint(1000,size=10)
x2= x.reshape(2,5) #reshape into a 2d array with 2 rows and 5 columns, just for fun.
print(x2)
print(np.divide(x2, 2))

print(np.power(x2, 2))

print(np.sqrt(x2))

[[402 400 631 776 449]
 [720 296 746 841  98]]
[[201.  200.  315.5 388.  224.5]
 [360.  148.  373.  420.5  49. ]]
[[161604 160000 398161 602176 201601]
 [518400  87616 556516 707281   9604]]
[[20.04993766 20.         25.11971337 27.85677655 21.1896201 ]
 [26.83281573 17.20465053 27.31300057 29.          9.89949494]]


## Iterating through numpy arrays

Just as you can iterate through lists or other data structures in base python, `numpy` ndarrays offer a more rapid environment for iterating. This can mean iterating through a 1-d array or 2-d array, or single 'columns' or 'rows' of 2-d arrays. The idea here is to speed up dataframe operations for large-scale data sets.

### Iterating through a 1-d array

In [20]:
arr = np.array([1, 2, 3, 4, 100, 150])

for x in arr:
  print(x) 

1
2
3
4
100
150


### Iterating through a 2-d array

In [21]:
arr = np.array([[1, 2, 3], [4, 100, 150]])

for x in arr:
  print(x) 

[1 2 3]
[  4 100 150]


### Iterating through one 'column' of a 2-d array

`arr[0:,2]` below specifies first, all rows, and second, the third column

In [22]:
arr = np.array([[1, 2, 3], [4, 100, 150]])

for x in arr[0:,2]:
  print(x) 

3
150


## Building `numpy` ndarrays from files, writing ndarrays to files

Using `numpy`, we can quickly read 1-dimensional or 2-dimensional data from text files into ndarray objects. This is very similar to using `read.csv`, `read.delim`, or `read.table` in `R` for those of you with `R` experience.

Lets start by reading a .csv text file into a 2-d array object. We will use a simple file with lat and long for 15 stands of *Pinus muricata* from the coast of California that resides in the same directory as this notebook.

`np.loadtxt` is used to read delimited text that does not have missing values. To control for missing values, `np.genfromtxt` can be used to load data from a text file, with missing values handled as defined. `np.loadtxt` works the same as `np.genfromtxt` when there is no missing data.

In [16]:
x2d = np.loadtxt("muricata_pops_lat_long.csv", delimiter=",")
print(x2d)

[[  34.003334   -119.614283  ]
 [  34.013676   -119.797136  ]
 [  35.244959   -120.879209  ]
 [  36.5935446  -121.9257046 ]
 [  38.519014   -123.246679  ]
 [  34.003334   -119.614283  ]
 [  39.19511111 -123.765     ]
 [  38.87558333 -123.6632778 ]
 [  34.024463   -119.692431  ]
 [  41.139904   -124.153496  ]
 [  38.062777   -122.848611  ]
 [  34.013676   -119.797136  ]
 [  38.576889   -123.33385   ]
 [  38.72941    -123.472548  ]]


In [17]:
for i in x2d[0:,0]:
    print("Latitude is: ", i)

Latitude is:  34.003334
Latitude is:  34.013676
Latitude is:  35.244959
Latitude is:  36.5935446
Latitude is:  38.519014
Latitude is:  34.003334
Latitude is:  39.19511111
Latitude is:  38.87558333
Latitude is:  34.024463
Latitude is:  41.139904
Latitude is:  38.062777
Latitude is:  34.013676
Latitude is:  38.576889
Latitude is:  38.72941


### Writing ndarray to file:

In [18]:
x = x2d[0:,0] # making 1-d array from twodarray, this will contain just latitudes
print(x)
np.savetxt("np_savetxt_test.txt", x)

[34.003334   34.013676   35.244959   36.5935446  38.519014   34.003334
 39.19511111 38.87558333 34.024463   41.139904   38.062777   34.013676
 38.576889   38.72941   ]


In [5]:
#from command line, to verify that file was created
!ls

Bloom_etal_2018_Reduced_Dataset.csv     python12_numpy_pandas.pptx
assignment_6_python.ipynb               pythonModules.pptx
logfiles.tgz                            state_plot04-01.png
muricata_pops_lat_long.csv              states_covid.csv
np_savetxt_test.txt                     ~$python12_pandas.pptx
primer_python7_numpy_pandas_partI.ipynb ~$pythonModules.pptx


# Getting started with pandas 

### side note on linux commands: 

`jupyter` notebooks can run linux commands as in the terminal, with some adjustments for certain commands. `!` at the start of a code block will allow many unix commands, `%` will allow others, and other adjustments are illustrated below

For the work below, make sure this jupyter notebook is saved in the same directory as the **states_covid.csv** file that you can find in the same github directory. You need your path to be correct to load files.

Below code is setting a working directory. This is only necessary if notebook is in different directory from where you want to read from or write to


In [20]:
### Change to pandas working directory. This is only necessary if notebook is in different directory from where you want to read from or write to
pandas_dir = '/Users/thomasparchman/Desktop/files/courses/data_scienceI/python_work/week12_python7_pandas'

In [21]:
cd $pandas_dir

/Users/thomasparchman/Dropbox/Mac/Desktop/files/courses/data_scienceI/python_work/week12_python7_pandas


Code below illustrates use of some linux commands from within `jupyter`

In [39]:
!mkdir new_dir
!ls

Bloom_etal_2018_Reduced_Dataset.csv pythonModules.pptx
assignment_6_python.ipynb           state_plot04-01.png
logfiles.tgz                        states_covid.csv
[1m[36mnew_dir[m[m                             ~$python12_pandas.pptx
primer_python6_pandas.ipynb         ~$pythonModules.pptx
python12_pandas.pptx


In [40]:
!rmdir new_dir
!ls

Bloom_etal_2018_Reduced_Dataset.csv pythonModules.pptx
assignment_6_python.ipynb           state_plot04-01.png
logfiles.tgz                        states_covid.csv
primer_python6_pandas.ipynb         ~$python12_pandas.pptx
python12_pandas.pptx                ~$pythonModules.pptx


### Covid data for the demo contained below

Data downloaded from: https://github.com/COVID19Tracking/covid-tracking-data   
(data stopped updating March, 2021) 

### Uses of numpy and pandas with a simple data science example 

Scraping covid data, summarizing by state, plotting

![covid_fig_04-1](state_plot04-01.png)

### Read and write files (using DataFrame) 

Make sure you have **states_covid.csv** in your pandas directory from above.   


In [41]:
import pandas as pd

state_covid_df = pd.read_csv('states_covid.csv') #read in csv
state_covid_df.head() #views the top 5 lines

Unnamed: 0,date,state,death,deathConfirmed,deathIncrease,deathProbable,hospitalized,hospitalizedCumulative,hospitalizedCurrently,hospitalizedIncrease,...,totalTestResults,totalTestResultsIncrease,totalTestsAntibody,totalTestsAntigen,totalTestsPeopleAntibody,totalTestsPeopleAntigen,totalTestsPeopleViral,totalTestsPeopleViralIncrease,totalTestsViral,totalTestsViralIncrease
0,2021-02-23,AK,290.0,,0,,1260.0,1260.0,38.0,9,...,1653425.0,4640,,,,,,0,1653425.0,4640
1,2021-02-23,AL,9660.0,7575.0,68,2085.0,45250.0,45250.0,762.0,122,...,2265086.0,4825,,,115256.0,,2265086.0,4825,,0
2,2021-02-23,AR,5377.0,4321.0,14,1056.0,14617.0,14617.0,545.0,47,...,2609837.0,4779,,,,436309.0,,0,2609837.0,4779
3,2021-02-23,AS,0.0,,0,,,,,0,...,2140.0,0,,,,,,0,2140.0,0
4,2021-02-23,AZ,15650.0,13821.0,148,1829.0,57072.0,57072.0,1515.0,78,...,7478323.0,19439,435091.0,,,,3709365.0,6212,7478323.0,19439


In [52]:
state_covid_df['death']

0          290.0
1         9660.0
2         5377.0
3            0.0
4        15650.0
          ...   
20103        NaN
20104        NaN
20105        NaN
20106        NaN
20107        NaN
Name: death, Length: 20108, dtype: float64

In [53]:
state_covid_df.shape #row, column length

(20108, 41)

In [54]:
state_covid_df.columns #views the column names

Index(['date', 'state', 'death', 'deathConfirmed', 'deathIncrease',
       'deathProbable', 'hospitalized', 'hospitalizedCumulative',
       'hospitalizedCurrently', 'hospitalizedIncrease', 'inIcuCumulative',
       'inIcuCurrently', 'negative', 'negativeIncrease',
       'negativeTestsAntibody', 'negativeTestsPeopleAntibody',
       'negativeTestsViral', 'onVentilatorCumulative', 'onVentilatorCurrently',
       'positive', 'positiveCasesViral', 'positiveIncrease', 'positiveScore',
       'positiveTestsAntibody', 'positiveTestsAntigen',
       'positiveTestsPeopleAntibody', 'positiveTestsPeopleAntigen',
       'positiveTestsViral', 'recovered', 'totalTestEncountersViral',
       'totalTestEncountersViralIncrease', 'totalTestResults',
       'totalTestResultsIncrease', 'totalTestsAntibody', 'totalTestsAntigen',
       'totalTestsPeopleAntibody', 'totalTestsPeopleAntigen',
       'totalTestsPeopleViral', 'totalTestsPeopleViralIncrease',
       'totalTestsViral', 'totalTestsViralIncrease'

While the above example is very straight forward with a clean csv file, **pd.read_csv()** is a very powerful tool for reading/parsing complicated data. For more information of all the commands it has, visit here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html. Otherwise, google is your best friend. Any issue you have, someone has figured it out already. 

One common issue with all data formats are *Dates*. Pandas has a way to read dates in without much headache and nice features for doing things with dates. You can also only select various columns, rename headers, remove headers, change what characters you want to be recognized as NAs, etc.

Below is an example of some of the things you can do. 

In [58]:
state_covid_sub_df = pd.read_csv('states_covid.csv',
                                 usecols=['date','state','death','positive','negative','totalTestResults'],
                                 parse_dates=['date'],
                                 infer_datetime_format=True)
state_covid_sub_df.head()

Unnamed: 0,date,state,death,negative,positive,totalTestResults
0,2021-02-23,AK,290.0,,55560.0,1653425.0
1,2021-02-23,AL,9660.0,1882180.0,488973.0,2265086.0
2,2021-02-23,AR,5377.0,2359571.0,316593.0,2609837.0
3,2021-02-23,AS,0.0,2140.0,0.0,2140.0
4,2021-02-23,AZ,15650.0,2953210.0,810658.0,7478323.0


#### check and make sure dtypes are right (dates specifically)

In [59]:
state_covid_sub_df.dtypes

date                datetime64[ns]
state                       object
death                      float64
negative                   float64
positive                   float64
totalTestResults           float64
dtype: object

### Write DataFrame to outfile 
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

**note:** make sure you provide the path or are in the working directory you want

In [60]:
outfile_path = pandas_dir + 'state_covid_sub.csv'
#print(outfile_path)
#outfile_path = os.path.join(pandas_dir,'state_covid_sub.csv')
state_covid_sub_df.to_csv(path_or_buf=outfile_path,index=False)

In [61]:
!ls

Bloom_etal_2018_Reduced_Dataset.csv pythonModules.pptx
assignment_6_python.ipynb           state_covid_sub.csv
logfiles.tgz                        state_plot04-01.png
primer_python6_pandas.ipynb         states_covid.csv


In [63]:
!find . -name '*csv'

./state_covid_sub.csv
./Bloom_etal_2018_Reduced_Dataset.csv
./states_covid.csv
