## Pandas & NumPy

## 1. Pandas - Panel Data

`Pandas` is a Python library for analyzing data. It provides classes, methods, and functions to `read`, `manipulate`, and `analyze` tabular data.
* Tabular or relational data is organized into rows and columns
* Rows contain individual elements
* Columns contain properties of each element

To use pandas in your code, use:
```python
import pandas as pd
```


### <font color='green'>Getting Data into `Python` using `Pandas`</font>

Pandas has two main data types or collection objects: `DataFrames` and `Series`

`DataFrames` are like excel tables (like DataFrames found in R) that consist of `columns` (each of which are series), `rows`, and `indices` (names attached to rows).
![pic2.png](attachment:pic2.png)

`Series` are similar to NumPy arrays (we will learn shortly) that consist of a `name`, a `value`, and an `index` (except numpy arrays do not have an index but the values and indices are themselves numpy arrays).

### Creating Series and DataFrames
1. Lists
    * lists can turn into dataframes
```python
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks'] 
df = pd.DataFrame(lst, index =['a', 'b', 'c', 'd', 'e', 'f', 'g'], columns =['Names'])
```

2. Dictionaries
    * dictionaries can turn into dataframes

```python
data = {'team':['Leicester', 'Manchester City', 'Arsenal'], 
        'player':['Vardy', 'Aguero', 'Sanchez'], 
        'goals':[24,22,19]
       }

my_df = pd.DataFrame(data)
my_df

#or
football = DataFrame(data, columns=['player','team','goals','played'], 
                     index=['one','two','three'])

#or from a list dictionaries
df = pd.DataFrame([{'Item':'Book', 'Cost':10},{'Item':'Pen','Cost':2}])

```

3. Series
    * Each series represents a column
```python
items = pd.Series(['Book','Pen'])
costs = pd.Series([10,2])
df = pd.DataFrame({'Item':items,'Cost':costs})
```

4. From a file
    * Pandas can import data from sql, csv, tsv, ecel, json, etc.

```python
pd.read_csv('data.csv')
```

In [119]:
#slide demo
#lists to df
import pandas as pd
lst = ['This', 'class', 'is', 'for', 'you', 'to', 'learn', 'python', '!'] 
lst
df = pd.DataFrame(lst, index =['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'], columns =['Names'])
df

['This', 'class', 'is', 'for', 'you', 'to', 'learn', 'python', '!']

Unnamed: 0,Names
a,This
b,class
c,is
d,for
e,you
f,to
g,learn
h,python
i,!


In [123]:
#slide demo
#dictionaries to df
data = {'team':['Leicester', 'Manchester City', 'Arsenal'], 
        'player':['Vardy', 'Aguero', 'Sanchez'], 
        'goals':[24,22,19]
       }
data
my_df = pd.DataFrame(data)
my_df


football = pd.DataFrame(data, columns=['player','team','goals','played'], 
                     index=['one','two','three'])
football

{'team': ['Leicester', 'Manchester City', 'Arsenal'],
 'player': ['Vardy', 'Aguero', 'Sanchez'],
 'goals': [24, 22, 19]}

Unnamed: 0,team,player,goals
0,Leicester,Vardy,24
1,Manchester City,Aguero,22
2,Arsenal,Sanchez,19


Unnamed: 0,player,team,goals,played
one,Vardy,Leicester,24,
two,Aguero,Manchester City,22,
three,Sanchez,Arsenal,19,


In [126]:
#slide demo
#series to df
items = pd.Series(['Book','Pen'])
items
costs = pd.Series([10,2])
costs
df = pd.DataFrame({'Item':items,'Cost':costs})
df

0    Book
1     Pen
dtype: object

0    10
1     2
dtype: int64

Unnamed: 0,Item,Cost
0,Book,10
1,Pen,2


In [152]:
#slide demo
#file to df
bn_csv = pd.read_csv('Popular_Baby_Names.csv')
bn.head(10)

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,FEMALE,HISPANIC,GERALDINE,13,75
1,2011,FEMALE,HISPANIC,GIA,21,67
2,2011,FEMALE,HISPANIC,GIANNA,49,42
3,2011,FEMALE,HISPANIC,GISELLE,38,51
4,2011,FEMALE,HISPANIC,GRACE,36,53
5,2011,FEMALE,HISPANIC,GUADALUPE,26,62
6,2011,FEMALE,HISPANIC,HAILEY,126,8
7,2011,FEMALE,HISPANIC,HALEY,14,74
8,2011,FEMALE,HISPANIC,HANNAH,17,71
9,2011,FEMALE,HISPANIC,HAYLEE,17,71


### Accessing Data
1. Lists
    * each element of the list represents a row
    * elements of a python list can be accessed by a numerical index 
```python
list[i]
```

2. Dictionaries
    * elements of a python dictionary can be accessed using keys
```python
dict['key']
```

3. Series
    * elements of a python series can be accessed both like lists and like dictionaries
```python
series['key']
series[i]
```

4. DataFrames
    * 
```python
bn_csv['column']
series[i]
```   

In [205]:
#slide demo
#Accessing Lists
lst[0]

'This'

In [208]:
#slide demo
#Accessing Dictionaries
data = {'team':['Leicester', 'Manchester City', 'Arsenal'], 
        'player':['Vardy', 'Aguero', 'Sanchez'], 
        'goals':[24,22,19]
       }

data['team']

['Leicester', 'Manchester City', 'Arsenal']

In [198]:
#slide demo
#Accessing Series
capitals = pd.Series({'France': 'Paris', 'Japan':'Tokyo', 'Germany': 'Berlin'})
capitals
capitals["France"]
capitals[1]

France      Paris
Japan       Tokyo
Germany    Berlin
dtype: object

'Paris'

'Tokyo'

In [211]:
#slide demo
#Accessing DataFrames
#bn_csv['Child\'s First Name'].head(10)
#bn_csv[['Child\'s First Name', 'Ethnicity']].head(10)
#bn_csv[0:3]
#bn_csv.iloc[0:3]
#bn_csv.iloc[:, [2,3,4]]
#bn_csv.iloc[[2,3], [2,3,4]]
#bn_csv.loc[:, ["Ethnicity", "Child's First Name"]]

### <font color='green'>Using Pandas, you can read in data from many sources</font>

###  Rreding in csv files 
```python 
pd.read_csv()
```

In [12]:
import pandas as pd
my_df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
display(my_df.head())

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
0,1/2/09 6:17,Product1,1200,Mastercard,carolina,Basildon,England,United Kingdom,1/2/09 6:00,1/2/09 6:08,51.5,-1.116667
1,1/2/09 4:53,Product1,1200,Visa,Betina,Parkville,MO,United States,1/2/09 4:42,1/2/09 7:49,39.195,-94.68194
2,1/2/09 13:08,Product1,1200,Mastercard,Federica e Andrea,Astoria,OR,United States,1/1/09 16:21,1/3/09 12:32,46.18806,-123.83
3,1/3/09 14:44,Product1,1200,Visa,Gouya,Echuca,Victoria,Australia,9/25/05 21:13,1/3/09 14:22,-36.133333,144.75
4,1/4/09 12:56,Product2,3600,Visa,Gerd W,Cahaba Heights,AL,United States,11/15/08 15:47,1/4/09 12:45,33.52056,-86.8025
5,1/4/09 13:19,Product1,1200,Visa,LAURENCE,Mickleton,NJ,United States,9/24/08 15:19,1/4/09 13:04,39.79,-75.23806
6,1/4/09 20:11,Product1,1200,Mastercard,Fleur,Peoria,IL,United States,1/3/09 9:38,1/4/09 19:45,40.69361,-89.58889
7,1/2/09 20:09,Product1,1200,Mastercard,adam,Martin,TN,United States,1/2/09 17:43,1/4/09 20:01,36.34333,-88.85028
8,1/4/09 13:17,Product1,1200,Mastercard,Renee Elisabeth,Tel Aviv,Tel Aviv,Israel,1/4/09 13:03,1/4/09 22:10,32.066667,34.766667
9,1/4/09 14:11,Product1,1200,Visa,Aidan,Chatou,Ile-de-France,France,6/3/08 4:22,1/5/09 1:17,48.883333,2.15


### <font color = 'Purple'> PRACTICE QUESTIONS</font>
A) Find a public online dataset and read into `my_dataf` variable as a dataframe.

B) Display the top 7 rows of the dataframe.

C) Select and display the second columns of the dataframe.

In [18]:
my_dataf = pd.read_csv('http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv')
display(my_dataf.head(10))
my_dataf.iloc[:, 1]

Unnamed: 0,Transaction_date,Product,Price,Payment_Type,Name,City,State,Country,Account_Created,Last_Login,Latitude,Longitude
0,1/2/09 6:17,Product1,1200,Mastercard,carolina,Basildon,England,United Kingdom,1/2/09 6:00,1/2/09 6:08,51.5,-1.116667
1,1/2/09 4:53,Product1,1200,Visa,Betina,Parkville,MO,United States,1/2/09 4:42,1/2/09 7:49,39.195,-94.68194
2,1/2/09 13:08,Product1,1200,Mastercard,Federica e Andrea,Astoria,OR,United States,1/1/09 16:21,1/3/09 12:32,46.18806,-123.83
3,1/3/09 14:44,Product1,1200,Visa,Gouya,Echuca,Victoria,Australia,9/25/05 21:13,1/3/09 14:22,-36.133333,144.75
4,1/4/09 12:56,Product2,3600,Visa,Gerd W,Cahaba Heights,AL,United States,11/15/08 15:47,1/4/09 12:45,33.52056,-86.8025
5,1/4/09 13:19,Product1,1200,Visa,LAURENCE,Mickleton,NJ,United States,9/24/08 15:19,1/4/09 13:04,39.79,-75.23806
6,1/4/09 20:11,Product1,1200,Mastercard,Fleur,Peoria,IL,United States,1/3/09 9:38,1/4/09 19:45,40.69361,-89.58889
7,1/2/09 20:09,Product1,1200,Mastercard,adam,Martin,TN,United States,1/2/09 17:43,1/4/09 20:01,36.34333,-88.85028
8,1/4/09 13:17,Product1,1200,Mastercard,Renee Elisabeth,Tel Aviv,Tel Aviv,Israel,1/4/09 13:03,1/4/09 22:10,32.066667,34.766667
9,1/4/09 14:11,Product1,1200,Visa,Aidan,Chatou,Ile-de-France,France,6/3/08 4:22,1/5/09 1:17,48.883333,2.15


0      Product1
1      Product1
2      Product1
3      Product1
4      Product2
         ...   
993    Product1
994    Product2
995    Product3
996    Product1
997    Product1
Name: Product, Length: 998, dtype: object

### Reading in excel files 

* using `read_excel()` function from pandas

```python 
pandas.read_excel()
```
* using other python libraries (e.g., openpyxl)
    * you would need to first 
    ```ptyhon
pip install openpyxl
conda install openpyxl
    ```
    * then bring in the library
    ```python 
    from openpyxl import load_workbook
    workbook = load_workbook(filename="sample.xlsx")
    workbook.sheetnames
    ```

### <font color = 'Purple'> Practice Questions</font>
Figure out how to read in excel data into ```my_dataFrame``` and read in following excel file.

In [8]:
import pandas as pd
df= pd.read_excel('latitude.xlsx')
display(df.head())


Unnamed: 0,country,1700
0,Afghanistan,34.565
1,Akrotiri and Dhekelia,34.616667
2,Albania,41.312
3,Algeria,36.72
4,American Samoa,-14.307


### Reading in JSON files

```python 
import requests 
import json
URL = 'https:...json'
response = requests.get(URL)
response.json()
#convert dictionary into json
outstring = json.dumps(dictionary)
#convert json into dictionary
dictionary = json.loads(outstring)
```

### Reading in data from APIs

![pic3.png](attachment:pic3.png)

### Reading in data by web scraping
    
Using `beautifulsoup`.
* First you will need to install it:
```python
pip install beautifulsoup4
conda install beautifulsoup4
```

* Then you need to import the library
```python 
from bs4 import BeautifulSoup
```
* Then you need to read in an HTML document
```python
soup = BeautifulSoup("Some HTML file")
print soup.prettify()
```

    



### Reading in data from SQL DBs

Python provides two popular interfaces for working with SQLite database library: PySQLite (is able to deal with MySQL, PostgreSQL, Oracle) and APSW (is able to deal mainly with SQLite).
* You don't need to install sqlite3 module. It is included in the standard library
* You need to import the library
```python
import sqlite3
```
* Then you need to create the connection
```python
conn = sqlite3.connect('example.db')
```



### <font color = 'Purple'> Practice Questions</font>

A) Go to this page (https://www.sqlitetutorial.net/sqlite-sample-database/) and download `chinook.db`

B) Using above information on how to use sqlite in python, connect to the db

C) Query tracks table using the database diagram and get the Album ID and the Track Name when the Composer is AC/DC.

In [25]:
import sqlite3
conn = sqlite3.connect('chinook.db')
cursor = conn.cursor()
cursor.execute("SELECT AlbumId, Name FROM tracks where Composer is 'AC/DC';")
print(cursor.fetchall())

[(4, 'Go Down'), (4, 'Dog Eat Dog'), (4, 'Let There Be Rock'), (4, 'Bad Boy Boogie'), (4, 'Problem Child'), (4, 'Overdose'), (4, "Hell Ain't A Bad Place To Be"), (4, 'Whole Lotta Rosie')]


### Reading in images
```python
import cv2
my_img = cv2.imread()
```

### Selection and Indexing
For data frames:
```python
#numeric indexing: integer-location based indexing
data.iloc[<row selection>, <column selection>]

bn_csv.iloc[0] #first row of a df but it's no longer a df but a series
bn_csv.iloc[-1] #last row of a df as a series
bn_csv.iloc[0:5] #first five rows of a df as a df
bn_csv.iloc[:,0] #first column of a df as a df
bn_csv.iloc[:-1] #last column of a df as a series
bn_csv.iloc[:, 0:2] #all the rows of first two columns of the df
bn_csv.iloc[[0,3,6],[0,5]] #1st, 4th, 7th rows and 1st, 6th columns
``` 
For data seires:
```python
# label indexing: unlike iloc with loc, both the start bound and the stop bound are inclusive. When using loc, integers can be used, but the integers refer to the index label and not the position.
data.loc[<row selection>, <charactter selection>]
bn_csv.loc[:, ["Ethnicity", "Child's First Name"]] #all the rows from these two columns
``` 

Subsetting data using conditionals 
```python
my_df[my_df.year == 2002]
```

In [224]:
#slide demo
#iloc
#bn_csv.iloc[0] #first row of a df but it's no longer a df but a series
#bn_csv.iloc[-1] #last row of a df as a series
#bn_csv.iloc[0:5] #first five rows of a df as a df
#bn_csv.iloc[:,0] #first column of a df as a df
#bn_csv.iloc[:-1] #last column of a df as a series
#bn_csv.iloc[:, 0:2] #all the rows of first two columns of the df
#bn_csv.iloc[[0,3,6],[0,5]] #1st, 4th, 7th rows and 1st, 6th columns

In [225]:
#slide demo
#loc
#bn_csv.loc[:, ["Ethnicity", "Child's First Name"]].head(10)

In [232]:
my_survey_df = pd.read_csv('https://raw.githubusercontent.com/tracykteal/data/master/biology/surveys.csv')
my_survey_df.head()

#slicing and iloc/loc
#my_survey_df['species'] == 'NL'
#my_survey_df.species #same as above but the column name is an attribute here

#my_survey_df[['plot', 'species']].head(10) #select more than one columns
#my_survey_df.loc[:,['plot', 'species']] #the same as above but with loc
#my_survey_df.iloc[:, 4:6] #the same as above but with iloc

#conditional
#my_survey_df[my_survey_df['species'] == 'NL']
#my_survey_df[my_survey_df.species == 'NL'] #the same as above but with the column as an attribute


Unnamed: 0,record_id,month,day,year,plot,species,sex,wgt
0,1,7,16,1977,2,NL,M,
1,2,7,16,1977,3,NL,M,
2,3,7,16,1977,2,DM,F,
3,4,7,16,1977,7,DM,M,
4,5,7,16,1977,3,DM,M,


In [237]:
capitals.iloc[0] #numeric in indexing: 
capitals.iloc[0][1]
capitals.loc['France'] #associative: 

'Paris'

'a'

'Paris'

### <font color = 'Purple'>Practice Questions</font>

1. Read in data from here: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
2. Print the number of rows and columns in the dataframe
3. Print top 10 rows
3. Print the columns
4. Print the the top 10 rows of the df's columns that have the word 'acid' in their title
5. Print the 3rd and 5th columns and all their rows

In [49]:
df= pd.read_csv('winequality-red.csv', sep=';')
print(df.shape)
df.iloc[0:9]
print(df.columns)
acid_cols = [col for col in df.columns if 'acid' in col]
print(df[acid_cols].head(10))
df.iloc[:,[2,4]]

(1599, 12)
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')
   fixed acidity  volatile acidity  citric acid
0            7.4              0.70         0.00
1            7.8              0.88         0.00
2            7.8              0.76         0.04
3           11.2              0.28         0.56
4            7.4              0.70         0.00
5            7.4              0.66         0.00
6            7.9              0.60         0.06
7            7.3              0.65         0.00
8            7.8              0.58         0.02
9            7.5              0.50         0.36


Unnamed: 0,citric acid,chlorides
0,0.00,0.076
1,0.00,0.098
2,0.04,0.092
3,0.56,0.075
4,0.00,0.076
...,...,...
1594,0.08,0.090
1595,0.10,0.062
1596,0.13,0.076
1597,0.12,0.075


### Dataframe properties

```python
df.columns #names of columns
df.size # 
df.shape # number of rows and columns
df.info() # column descriptions
df.describe() # descriptive statistics
```

### Modifying Data
* Modifying individual columns is the same as modifying series
```python
cost = df['Cost]
cost = cost+10
```
**Note** that modifying the series modifies the dataframe

* Rows can be modified using slicing/indexing sections of the data frame

In [245]:
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df

#df.iloc[[0],[0]] = 'Nick Jonas'
#df
#df.iloc[[0],[0]] == 'Nick Jonas'
#df
#df.iloc[0, df.columns.get_loc('Age')] = 100
#df

#num_test = df['Age']
#num_test = num_test + 10
#num_test

Unnamed: 0,Name,Age
0,tom,10
1,nick,15
2,juli,14


0    20
1    25
2    24
Name: Age, dtype: int64

### Querying data through Boolean

1. First we create a boolean series where each entry inidicates if the species is our value of interest.
2. We can slice the dataframe with the boolean series

In [6]:
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
df.head(10)
print(df['species'].unique())


is_setosa = df['species']=='setosa'
is_setosa
df.loc[is_setosa]

#has_large_width = df['sepal_width'] > 4
#has_large_width
#df.loc[is_setosa & has_large_width]


['setosa' 'versicolor' 'virginica']


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


### Descriptive Statistics
* Both series and dataframes support global aggregation functions
* non numeric columns may be omitted in the global agg. summary
    * min()
    * max()
    * std()
    * sum()
    * mean()

In [255]:
#df.min()
#df['sepal_length'].min()

### Split-apply-combine with `groupby`

* Split-apply-combine is a common ananlysis task. It can be done with `groupby()` which allows you to generate subtotals by one more column.
```python
df.groupby('species').mean()
```
* Here we:
    *  We split the dataframe by value of 'species' column
    * For each split we computed the mean
    * Combined the means into a results dataframe

In [63]:
#import numpy as np
df.head(10)
df.groupby('species').agg(["median", "mean"])

Unnamed: 0_level_0,sepal_length,sepal_length,sepal_width,sepal_width,petal_length,petal_length,petal_width,petal_width
Unnamed: 0_level_1,median,mean,median,mean,median,mean,median,mean
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
setosa,5.0,5.006,3.4,3.428,1.5,1.462,0.2,0.246
versicolor,5.9,5.936,2.8,2.77,4.35,4.26,1.3,1.326
virginica,6.5,6.588,3.0,2.974,5.55,5.552,2.0,2.026


### <font color = 'Purple'> Practice Questions</font>

The dataframe `london_df` loads the `Number of International Visitors to London Dataset` from https://data.london.gov.uk/dataset/number-international-visitors-london
 1. Print the number of rows and columns of the dataframe 
 2. Print the number of rows of the dataframe and format the print to put the thousands comma separator--as in `23,400` instead of `23400`
 3. Print the names of all the columns of the dataframe
 4. Print the 4th row of the dataframe
 5. What is the maximum number of nights spent by market for the 3rd quarter (Q3) of 2017? (**hint**: use groupby)
 6. Which market was the source of the minimal number of visits in the whole dataset? (**hint**: use loc)
 8. What was the most popular duration of stay (`dur_stay`) in 2016? (**hint**: think about a count method or a function)
 9. Print the market, from which the most visits/visitors came for Business purpose in 2009.
 
**Note**: please use comments to number each of your answers.


In [2]:
import pandas as pd
london_df= pd.read_csv('international-visitors-london-raw.csv')
#Problem-1
london_df.shape
#Problem-2
count_row = london_df.shape[0]
print('Number of rows are:', format(count_row, ",d"))
#Problem-3
print('The columns are:', london_df.columns)
#Problem-4
london_df.iloc[3]
#Problem-5
df = london_df.groupby(['year', 'quarter','market']).agg({'nights': ['max']})
df.columns=['Maximum nights']
df = df.reset_index()
is_year = df['year']== 2017
is_quarter= df['quarter']=='Q3'
df.loc[is_year & is_quarter]
#problem-6
london_df.loc[london_df.visits == london_df.visits.min()]
#problem-7
df1= london_df.loc[london_df.year==2016]
df1= df1.groupby('dur_stay')['dur_stay'].count()
#problem-8
is_y = london_df['year']== 2016
is_b= london_df['purpose']=='Business'
df2= london_df.loc[is_y & is_b]
df3 = df2.groupby(['market', 'purpose']).agg({'visits': ['max']})
df3.columns=['max_visits']
df3 = df3.reset_index()
df3.loc[df3.max_visits == df3.max_visits.max()]


Number of rows are: 58,976
The columns are: Index(['year', 'quarter', 'market', 'dur_stay', 'mode', 'purpose', 'area',
       'visits', 'spend', 'nights', 'sample'],
      dtype='object')


Unnamed: 0,market,purpose,max_visits
60,USA,Business,62.775518


In [5]:
df = london_df.groupby(['year', 'quarter','market']).agg({'nights': ['max']})
df.columns=['Maximum nights']
df = df.reset_index()
df.head()
is_quarter= df['quarter']=='Q3'
df.loc[is_quarter]
df.head()


Unnamed: 0,year,quarter,market,Maximum nights
0,2002,Q1,Argentina,38.8587
1,2002,Q1,Australia,148.406548
2,2002,Q1,Austria,23.3499
3,2002,Q1,Belgium,49.196098
4,2002,Q1,Brazil,77.216499


## 2. NumPy - Numeric Python 

NumPy is the fundamental package for `scientific computing` with Python. `ndarray`(n-dimensional array) is the primary data structure in `NumPy`. It is a `table of elements` (usually numbers), all of the same type, `indexed` by a tuple of non-negative integers. NumPy Array is an alternative to `python lists` with wider abilities: perform calculations over entire arrays. In NumPy `dimensions` are called `axes`.

To use numpy in your code, use:
```python
#run this in command line (terminal)
pip install numpy
#or
conda install numpy

#then add this to your notebook
import numpy as np
```

In [1]:
#why is NumPy useful? Why can't we just stick to lists?
length = [120, 110, 134]
width = [45, 63, 78]
length * width

TypeError: can't multiply sequence by non-int of type 'list'

In [2]:
#this is how numpy is more useful than lists
import numpy as np
#now convert these into NumPy arrays
length_np = np.array(length)
width_np = np.array(width)
#you can see the calculations were done element wise
length_np * width

array([ 5400,  6930, 10452])

### NumPy Array Type

NumPy assumes that your array elements are of the single type: array of floats, boolean, strings, etc. So numpy array is a new kind of a python type, hence, it has its own methods. If you create a list with different types of data, some of the elements' types will be changed to end up with a homogeneous list (see below). This is known as `type coercion`.

```python
my_list_a = [100.0, True]
my_liist_b = [1, 3, "hello"]
my_list_a + my_liist_b

#compare the output of the above code with the one below

my_array_a = np.array([100, 13, 7])
my_array_b = np.array(['bob', True, 103])
```

In [3]:
#slide demo
my_list_a = [100.0, True]
my_liist_b = [1, 3, "hello"]
my_list_a + my_liist_b

[100.0, True, 1, 3, 'hello']

In [5]:
#slide demo
my_array_a = np.array([100, 13, 7])
#my_array_b = np.array(['bob', True, 103])
#my_array_a + my_array_b
print(my_array_b)

# 21-character unicode string

['bob' 'True' '103']


In [6]:
#True is converted to 1, False is converted to 0.
#Knowing this, what do you think below code's output is?
np.array([True, 1, 2, True]) + np.array([3, False, 7, True])

array([4, 1, 9, 2])

### NumPy <font color='green'> Array</font>
NumPy’s array class is called `ndarray`. It is also known by the alias `array`. Note that `numpy.array` is not the same as the Standard Python Library class `array.array`, which only handles one-dimensional `arrays` and offers less functionality.

ndarray:
   * supports multiple numeric types (e.g., float, int, complex)
   * numpy arrays have attributes (ndim, shape, size, dtype) unlike standard python arrays
   * supports operators (vectorization)
   * allows indexing, slicing, reshaping

### <font color='green'>`Arrays` & Dimenstionality</font>

`array` transforms sequences of sequences into `two-dimensional arrays`, sequences of sequences of sequences into `three-dimensional arrays`, and so on.

```python
exp_arr = np.array([(1.5,2,3), (4,5,6)])
```

The type of the array can also be explicitly specified at creation time:

```python
exp_arr = np.array( [ [1,2], [3,4] ], dtype=complex )

```

![pic1.png](attachment:pic1.png)


### 1-3D Arrays

#### 1D array
```python
np.arange(4)
```
#### 2D array
```python
#think of these as list of lists
np.arange(6).reshape(2,3)
```
#### 3D array
```python
np.arange(24).reshape(4,3,2)
```

In [20]:
#slide demo
#np.arange(4)
#np.arange(6).reshape(2,3)
#np.arange(24).reshape(4,3,2)

### <font color='green'>Creating a `numpy array`:</font>
* From data in a text or binary file
```python
numpy.fromfile()
```
* From scatch. `Numpy.array` is just a convenience `function` to create an `ndarray`; it is not a class itself.
```python
my_array = np.array([1,2,3]) #simple array
my_array = np.array(range(10)) #without specifying the elements
my_array = np.ndarray(range(10)) #the same as above but not recommended
```
* Using `arange()` to create an array from a sequence of numbers
```python
my_array = np.arange(10) #the same as above but using numpy method, it creates an array of numbers from 0-9
my_array = np.arange(15).reshape(3, 5) #creates an array from 0-14 and arrange it as 3 rows and 5 columns
```
* Using functions `zero()`, `one()`, `empty()` to create an array with placeholder content

```python
my_array = np.zeros((3,4)) #creates an array of 4 columns by 3 rows of 0s
my_array = np.ones((2,3)) #creates an array of 3 columns by 2 rows of 1s
my_array = np.empty((2,3)) #creates an array of 3 columns by 2 rows of random values
```

* Using `random()` to create an array with random numbers: random is a library with lots of random integer packages (e.g., rand, randint, poisson, uniform, etc.)
```python
my_array = np.random.random((2,4)) #creates a 2d array with random numbers
```

In [7]:
#slide demo
#From scatch
my_array = np.array([1,2,3]) #simple array
my_array
my_array = np.array(range(10)) #without specifying the elements
my_array
my_array = np.ndarray(range(10)) #the same as above but not recommended
my_array

array([], shape=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9), dtype=float64)

In [36]:
#slide demo
#Using arange() to create an array from a sequence of numbers
my_array = np.arange(10) #the same as above but using numpy method, it creates an array of numbers from 0-9
my_array
my_array = np.arange(15).reshape(3, 5) #creates an array from 0-14 and arrange it as 3 rows and 5 columns
my_array

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [51]:
#slide demo
#Using functions zero(), one(), empty() to create an array with placeholder content
my_array = np.zeros((3,4)) #creates an array of 4 columns by 3 rows of 0s
my_array = np.ones((2,3)) #creates an array of 3 columns by 2 rows of 1s
my_array = np.empty((3,3)) #creates an array of 3 columns by 3 rows of random values as initial content

In [8]:
#Using random()
my_array = np.random.random((2,4)) #creates a 2d array with random numbers

#create 2D array with random numbe simulation
#1.75 is the distribution mean, 0.20 is the distrubution sd, and 5000 is the number of samples we took from this distrubution 
height = np.round(np.random.normal(60, 3, 1000), 2)
weight = np.round(np.random.normal(130, 10, 1000), 2)
h_w_np = np.column_stack((height, weight))
h_w_np

array([[ 62.29, 121.75],
       [ 57.25, 140.45],
       [ 62.49, 135.63],
       ...,
       [ 57.18, 142.77],
       [ 63.58, 117.57],
       [ 66.8 , 130.11]])

### <font color = 'Purple'> Practice Questions</font>

A) Create a numpy array with two different approaches

A) The numpy array should have an integer data type

B) The numpy array should be 2 dimensional 

C) The numpy array should have 3 columns and 6 rows

In [16]:
#Problem-1
my_array1= np.array(['Roshni',25.0,7,True])
print(my_array1)
my_array2=np.arange(10)
print(my_array2)
#problem-2
my_array3= np.array([5,6,7,8])
print(my_array3)
#Problem-3
my_array4=np.arange(12).reshape(4,3)
print(my_array4)
#problem-4
my_array5=np.arange(18).reshape(3,6)
print(my_array5)

['Roshni' '25.0' '7' 'True']
[0 1 2 3 4 5 6 7 8 9]
[5 6 7 8]
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]]


### <font color='green'> `ndarray` attributes</font>

`ndim`: the number of axes/dimensions of the array

`dtype`: an object describing the type of the elements in the array.

`size`: the total number of elements

Remember these are similar to methods but aren't methods since they don't have `()` after them. These are attributes.

In [18]:
a = np.array(range(10))
c = np.array([[1,2], [4,5]])
d = np.array([[1,2,3], [4,5,6]]) #not this is an object
print(a,c,d)
#print the ndim
print(a.ndim, c.ndim, d.ndim) #
#pring the dtype
print(a.dtype, c.dtype, d.dtype)
#print the size
print(a.size, c.size, d.size)
#print the shape
print(a.shape, c.shape, d.shape)

[0 1 2 3 4 5 6 7 8 9] [[1 2]
 [4 5]] [[1 2 3]
 [4 5 6]]
1 2 2
int64 int64 int64
10 4 6
(10,) (2, 2) (2, 3)


### <font color='green'>Operations with `numpy arrays`</font>

Arithmetic operators on arrays apply elementwise.

```python
x = np.array([1,2,3])
y = np.array([5,6,7])

x+y
x*y
x-y
x**y
```

In [21]:
#1 dimensional
x = np.array([1,2,3])
y = np.array([5,6,7])
print(x)
print(y)
#x*2 #element wise multiplication
#z=x + np.array([10]) #element wise addition
#print(z)
#x*y #element by elements multiplicaton: 1*5, 2*6, 3*7
#print(x.dot(y)) #dot product:(x1*y1 +... + xn + yn): 1*5+2*6+3*7
print(x @ y) #same as above: adds up all the element values after multiplying them

[1 2 3]
[5 6 7]
38


In [25]:
#2 dimensional
z = np.arange(6,10).reshape(2, 2)
w = np.arange(4).reshape(2, 2)
print(z)
print(w)
#print(z*2) #element wise multiplication
#z + z #same as above: adds each element to itself
#print(z + np.array([10, 10])) #element wise addition
#z*w #element by elements multiplicaton: 6*0, 7*1, 8*2, 9*3
z @ w #dot product:(6*0 + 7*2, 6*1+7*3, 8*0+9*2, 8*1+9*3): 
#z.dot(w) ##same as above

[[6 7]
 [8 9]]
[[0 1]
 [2 3]]


array([[14, 27],
       [18, 35]])

### <font color='green'>Descriptive Statistics</font>


```python
x = np.array([1,2,3, 10, 565, 34, 67])
my_stats = x.sum()
my_stats = x.mean()
my_stats = x.min()
my_stats = x.max()
```

In [29]:
x = np.array([1,2,3, 10, 565, 34, 67])
x.sum()
print(np.sum(x[0:3]))

y = np.arange(50,100).reshape(10, 5)
y.mean()
print(np.corrcoef(y[:,0], y[:,2]))
np.std(y[:,2])

#how can i get the median of second row?



6
[[1. 1.]
 [1. 1.]]


14.361406616345072

In [64]:
my_array = np.arange(16).reshape(4, 4)
my_array
#what would this grab?
#round(my_array[1][2], 3)

#what about this?
#round(my_array[1,2], 3)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

### <font color='green'>Indexing, Slicing, Iterating</font>

`One-dimensional arrays` can be indexed, sliced and iterated over, much like `lists`.

```python
arr = np.array(range(10)).reshape(5,2)
arr
```

* Print matrix array's element 0
```python
print(arr[0:1])
```

* Print matrix array's element 0's elements 
``` python
print(arr[0:1, 1:3])
```

* Print out last element of each of the matrix array elements
``` python
print(arr[:, -1])
```

* Print out last element of the matrix array elements
``` python
print(arr[:][-1])
```

* The same as above
``` python
print(arr[-1])
```

* Find the elements in the array that are > 3
``` python
greater_than_3 = arr>3
print(arr[arr>3])
```

* The same as above
``` python
print(arr[greater_than_3])
```

* Iterating over arrays
``` python
my_arr = np.array(range(10))
for row in my_arr:
    print(row)
```

* updateing elements while slicing
``` python
a = np.arange(10)**3
a[:6:2] = -10
a
```

* reversing the order of the array
``` python
a[::-1]
```

In [2]:
#arr = np.array(range(10)).reshape(5,2)
#arr

#Print matrix array's element 0
#print(arr[0:1])

#Print matrix array's element 0's elements
#print(arr[0:1, 1:3]) 
#or 
print(arr[0:1, 1])

#Print out last element of each of the matrix array elements
#print(arr[:, -1]) 
#or 
#print(arr[:, 1])

#Print out last element of the matrix array elements
#print(arr[:][-1]) 
#or 
#print(arr[-1])

#Find the elements in the array that are > 3
#print(arr[arr>3]) 
#or 
#greater_than_3 = arr>3 
#print(arr[greater_than_3])

#Iterating over arrays
#my_arr = np.array(range(10))
#for row in my_arr:
#    print(row)

#Update elements while slicing
#a = np.arange(10)**3
#a
#a[:6:2] = -10
#a

#Reversing the order of the array
#a[::-1]

### <font color = 'Purple'> Practice Questions</font>

A) Create a 1D `house_dim_np` numpy array from `house_dim` list 

B) Convert the numbers which are in centimeterrs into inches and store in a new array called `house_dim_np_in`

C) Create a 2D `player_weight_np` numpy array from `player_weight` list  

D) Print out the 4th row of `player_weight`

E) Print out the entire second column of `player_weight`

F) Print out value of the player #5 from `player_weight`

In [38]:
house_dim = [180, 215, 210, 210, 188, 176, 209, 200]
#Problem(A)
house_dim_np=np.array(house_dim)
print(house_dim_np)
#pProblem(B)
house_dim_np_in=house_dim_np*np.array([0.393701])
print(house_dim_np_in)

player_weight = [[1, 167],
            [2, 180],
            [3, 250],
            [4, 223],
            [5, 187], 
            [6, 169],
            [7, 210]]
#Problem(C)
player_weight_np=np.array(player_weight).reshape(7,2)
print(player_weight_np)
#Problem(D)
print(player_weight_np[3,:])
#Problem(E)
print(player_weight_np[:,1])
#Problem(F)
player5_weight=player_weight_np[4,1]
print('The height of player 5 is:', player5_weight)

[180 215 210 210 188 176 209 200]
[70.86618  84.645715 82.67721  82.67721  74.015788 69.291376 82.283509
 78.7402  ]
[[  1 167]
 [  2 180]
 [  3 250]
 [  4 223]
 [  5 187]
 [  6 169]
 [  7 210]]
[  4 223]
[167 180 250 223 187 169 210]
The height of player 5 is: 187


### <font color='green'>More Methods</font>

In [315]:
a = np.random.random((3,4))
a
b = np.floor(10*a)
b
print(a)
#flatten the array
a.ravel()
#transform the array
a.T

array([[0.2575292 , 0.80055217, 0.41068078, 0.94132232],
       [0.44969683, 0.36206241, 0.80915254, 0.22902834],
       [0.26479056, 0.95755342, 0.02686927, 0.47646961]])

array([[2., 8., 4., 9.],
       [4., 3., 8., 2.],
       [2., 9., 0., 4.]])

[[0.2575292  0.80055217 0.41068078 0.94132232]
 [0.44969683 0.36206241 0.80915254 0.22902834]
 [0.26479056 0.95755342 0.02686927 0.47646961]]


array([0.2575292 , 0.80055217, 0.41068078, 0.94132232, 0.44969683,
       0.36206241, 0.80915254, 0.22902834, 0.26479056, 0.95755342,
       0.02686927, 0.47646961])

array([[0.2575292 , 0.44969683, 0.26479056],
       [0.80055217, 0.36206241, 0.95755342],
       [0.41068078, 0.80915254, 0.02686927],
       [0.94132232, 0.22902834, 0.47646961]])

### <font color = 'Purple'> Practice Questions</font>

A) Create a 2 dimensional array

B) Print the elements 5, 6, and 7

C) Update two of the elements to some other number than the initial number

In [8]:
import numpy as np
my_array = np.arange(16).reshape(4, 4)
print(my_array)

print('The 5th element of the array is', my_array[1,0])
print('The 6th element of the array is', my_array[1,1])
print('The 7th element of the array is', my_array[1,2])

my_array[2,1]=40
my_array[3,3]= 61
print(my_array)


[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
The 5th element of the array is 4
The 6th element of the array is 5
The 7th element of the array is 6
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8 40 10 11]
 [12 13 14 61]]


### <font color = 'Purple'> Practice Questions</font>

1. Determine the shape, number of dimensions and type of elements in the numpy array arr
2. Determine the sum of all elements in a`rr`
3. Determine the mean of all elements in `arr`
4. Create an array of the same shape as `arr` but filled with zeros
5. Create an array of the same shape as `arr` but filled with ones
6. Create an array of the same shape as `arr` but where all elements are the square root values
7. Create a new array called `arr_new` which should be of shape 8x8 and be the result of a multiplication of `arr` with the transpose of the `arr.

In [324]:
# Import necessary libraries and data (this is something you have already worked with above)
arr = np.arange(48).reshape(8, 6)

### End of Notebook

In [54]:
#to output the entire results without having to type print
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
#add scrolling to the slides
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'width': 1024,
        'height': 768,
        'scroll': True,
})