# Pandas
## Series/vectors 
* a series is a 1D array that can hold any data type and can be indexed by labels as well as zero-based integers
* data within a series does not have to be homogeneous 
* can be created from a list with standard integer indexing or given labels
* can be created from dictionary key-value pairs, or as a repeated series

In [47]:
import pandas as pd

# converting from list with standard indexing
list = [2, 3, 4, 5, 6]
series1 = pd.Series(list)
print("from standard list:\n", series1)

# converting from list with custom indexing
index = ['a','b','c','d','e']
series2_1 = pd.Series(list, index=index)
series2_2 = pd.Series(list, index=['f','g','h','i','j'])
print("custom indexing: from lists\n", series2_1)

# from dictionary
dict = {'sun':1, 'mon':2, 'tues':3, 'wed':4}
series3 = pd.Series(dict)
print("\nfrom a dictionary:\n", series3)

# repeating values
series4 = pd.Series(3, index=['x','y','z'])
print("\nrepeated values:\n", series4)

from standard list:
 0    2
1    3
2    4
3    5
4    6
dtype: int64
custom indexing: from lists
 a    2
b    3
c    4
d    5
e    6
dtype: int64

from a dictionary:
 sun     1
mon     2
tues    3
wed     4
dtype: int64

repeated values:
 x    3
y    3
z    3
dtype: int64


## dataframes
* dataframes are 2d tables similar to spreadsheets
* dataframes have homogeneous data within each column, but the columns can be heterogeneous from each other 
* dataframes are size mutable within columns and overall
* they can be created from a dictionary, excel spreadsheet, a list of lists, or from a numpy array

In [48]:
# from dictionary
dict2 = {
    "UID": [11223300, 22334400, 33445500],
    "GPA": [3.1, 3.9, 2.3],
    "Year": [1, 4, 2]
}
dictFrame = pd.DataFrame(dict2)
print("from dictionary:\n", dictFrame)

import numpy as np
# from numpy array
nparray = np.array([[1, 2.4, 1/5],[2, 4.3, 2/6]])
npFrame = pd.DataFrame(nparray, columns=['int','float','frac'])
print("\nfrom numpy array\n", npFrame)

from dictionary:
         UID  GPA  Year
0  11223300  3.1     1
1  22334400  3.9     4
2  33445500  2.3     2

from numpy array
    int  float      frac
0  1.0    2.4  0.200000
1  2.0    4.3  0.333333


### reading and writing files
* reading a csv file: pd.read_csv(filename)
* writing to csv: frameName.to_csv(filename)
* reading a excel file: pd.read_excel(filename)
* writing to excel: frameName.to_excel(filename)

In [49]:
# reading from a csv file:
flowerset = pd.read_csv('flower_dataset.csv')
flowerset.head()

Unnamed: 0,species,size,fragrance,height_cm
0,rose,medium,mild,48.55
1,shoeblack plant,medium,mild,147.07
2,shoeblack plant,medium,none,102.93
3,hibiscus,large,none,184.0
4,shoeblack plant,large,mild,83.07


### querying 
* boolean indexing: frame[frame[label] >=< value]
* .query('label <=> value and label <=> value')
* .loc[]
* isin(): frame[frame[label].isin(value)]
* between(): filters for values between two given values. frame[frame[label].between(low, high)]

In [50]:
filter1 = flowerset[flowerset['size'] == 'medium']
print(filter1)
print()
filter2 = flowerset.query('size == "medium" and fragrance == "mild" and height_cm < 31')
print(filter2)

              species    size fragrance  height_cm
0                rose  medium      mild      48.55
1     shoeblack plant  medium      mild     147.07
2     shoeblack plant  medium      none     102.93
15    shoeblack plant  medium      none     101.99
16               rose  medium    strong      33.10
...               ...     ...       ...        ...
9983             rose  medium      mild      64.34
9986             rose  medium    strong      48.27
9989             rose  medium    strong      50.14
9996  shoeblack plant  medium      mild     145.23
9999             rose  medium      mild      88.11

[3337 rows x 4 columns]

     species    size fragrance  height_cm
2993    rose  medium      mild      30.85
3784    rose  medium      mild      30.70
4716    rose  medium      mild      30.53
5875    rose  medium      mild      30.54
6925    rose  medium      mild      30.23
7070    rose  medium      mild      30.85
7130    rose  medium      mild      30.27
8221    rose  medium      

### resizing: adding/dropping columns
* declaring a list as a new column 
* frame.insert(loc, column, value), where loc is insertion index, column is the column label, value is the column values
* frame.assign(**kwargs): returns a NEW dataframe with updated/added columns. good for if the new column will be the same for chunks of the preexisting frame, for modifying singular values, or for changing large chunks of values to same value. also wors with lambda functions
* pd.concat(objs, axis, join,...), where objs are the data frame(s) to be combined, axis is the axis to concatenate along (0 for rows, 1 for columns)
* frame.drop(labels, axis=1) drops the specified column(s)

In [57]:
# adding a new column: with a list
colorlist = []
colors = ['red','yellow','pink','white']
for i in range(len(flowerset)):
    if flowerset.loc[i, 'species'] == 'rose':
        colorlist.append(np.random.choice(colors))
    else:
        colorlist.append("-")
flowerset['color'] = colorlist

filter2 = flowerset.query('size == "medium" and fragrance == "mild" and height_cm < 31')
print(filter2)


     species    size fragrance  height_cm   color
2993    rose  medium      mild      30.85    pink
3784    rose  medium      mild      30.70  yellow
4716    rose  medium      mild      30.53     red
5875    rose  medium      mild      30.54  yellow
6925    rose  medium      mild      30.23    pink
7070    rose  medium      mild      30.85   white
7130    rose  medium      mild      30.27   white
8221    rose  medium      mild      30.94  yellow
8448    rose  medium      mild      30.36    pink
9587    rose  medium      mild      30.89   white
9625    rose  medium      mild      30.55   white
9831    rose  medium      mild      30.55   white


### slicing 
* rows by index range: frame[1:4]
* rows and colums: frame.loc[] allows slicing by labels, frame.iloc[] allows slicing by integer indicies

In [52]:
sliced1 = flowerset.loc[1:100, ['species','color']]
print(sliced1)

sliced2 = flowerset.iloc[4:16, 0:2]
print(sliced2)

             species   color
1    shoeblack plant    pink
2    shoeblack plant   white
3           hibiscus     red
4    shoeblack plant  purple
5           hibiscus  orange
..               ...     ...
96              rose    pink
97              rose    blue
98   shoeblack plant  purple
99          hibiscus    pink
100             rose   white

[100 rows x 2 columns]
            species    size
4   shoeblack plant   large
5          hibiscus   large
6              rose   small
7              rose   small
8          hibiscus   large
9   shoeblack plant   large
10             rose   small
11             rose   small
12         hibiscus   large
13         hibiscus   large
14             rose   small
15  shoeblack plant  medium
