## Note: Pandas & Series

It is important to remember that pandas is made up of 'series'. Which is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. 

Therefore some of the methods and attributes mentioned in here are for the 'series' of one of the pandas columns and some are for the dataframe itself. 

Here is a list of what is available:
https://pandas.pydata.org/pandas-docs/stable/api.html#computations-descriptive-stats

## 0. Import packages & Data

In [10]:
#First import pandas library
import pandas as pd
#Also import numpy as useful later
import numpy as np

In [11]:
import os

In [7]:
#For displaying tables nicer (side by side etc)
from IPython.display import display_html 

In [12]:
data_dir = os.getcwd() + "/datasets"

Some useful parameters for the read.csv could be:

* usecols = which columns to take

In [14]:
# Note that we could set the index_col in the read in

# Read in the wine data
# https://archive.ics.uci.edu/ml/datasets/wine

wine = pd.read_csv(data_dir + "/wine-total.csv")
wine.shape

(6497, 14)

## 1 Create a Dataframe

### 1.1 List to Dict to DF

In [4]:
#Here is an interesting example where we set the row_labels 
# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(dict)
print(cars)

         country  drives_right  cars_per_cap
0  United States          True           809
1      Australia         False           731
2          Japan         False           588
3          India         False            18
4         Russia          True           200
5        Morocco          True            70
6          Egypt          True            45


### 1.2 Lists straight to df, setting columns

In [5]:
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
headers = ['country', 'drives_right', 'cars_per_cap']
combined_list = list(zip(names, dr,cpc))
cars2 = pd.DataFrame(combined_list, columns=headers)
print(cars2)

         country  drives_right  cars_per_cap
0  United States          True           809
1      Australia         False           731
2          Japan         False           588
3          India         False            18
4         Russia          True           200
5        Morocco          True            70
6          Egypt          True            45


### 1.3 Setting row labels

In [6]:
# Definition of row_labels
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
# Specify row labels of cars
cars.index = row_labels
# Print cars again
print(cars)
#Note that setting the index_col = 0 when importing this will set the labels as we have done above

           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JAP          Japan         False           588
IN           India         False            18
RU          Russia          True           200
MOR        Morocco          True            70
EG           Egypt          True            45


### 1.4 Attributes vs Methods

An important distinction is between attributes and methods.

* Attributes are accessed by the dot (df.attribute) and DO NOT require brackets
* Methods are calling a function and hence do require the brackets. 

See below the difference between .shape and .describe()

In [9]:
# Note that we can find all the available attributes and methods of an object using dir
len(dir(cars2))
# As we can see there are many available

453

### 1.5 Type Conversions

#### 1.5.1 Numeric

In [10]:
#See how we have some strings here?
s = pd.Series(['1.0', '2', -3])

#Let us do a default conversion
num1 = pd.to_numeric(s)

#There is an option to 'downcast' 
num2 = pd.to_numeric(s, downcast='float')
num3 = pd.to_numeric(s, downcast='integer')

#We can ask it to ignore errors for attempting to cast strings into numerical
s2 = pd.Series(['apple', '1.0', '2', -3])
num4 =  pd.to_numeric(s2, errors='ignore')

#Or we can force a conversion which will result in NaN
num5 = pd.to_numeric(s2, errors='coerce')

In [11]:
print(num1, "\n", num2, "\n", num3, "\n", num4, "\n", num5)

0    1.0
1    2.0
2   -3.0
dtype: float64 
 0    1.0
1    2.0
2   -3.0
dtype: float32 
 0    1
1    2
2   -3
dtype: int8 
 0    apple
1      1.0
2        2
3       -3
dtype: object 
 0    NaN
1    1.0
2    2.0
3   -3.0
dtype: float64


#### 1.5.2 Datetime

The to_datetime function has a lot of configuration options, however the classic is the strftime option to set the exact date format.

Note this can be quite computationally expensive from experience.

In [12]:
cars["dates"] = ['2015-01-04','2015-02-04','2017-01-04',
                 '2015-01-14','2015-01-24','2015-01-10',
                 '2016-01-04']
cars.info() #Notice how this is an object?

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, US to EG
Data columns (total 4 columns):
country         7 non-null object
drives_right    7 non-null bool
cars_per_cap    7 non-null int64
dates           7 non-null object
dtypes: bool(1), int64(1), object(2)
memory usage: 231.0+ bytes


In [None]:
cars["dates"] = pd.to_datetime(cars["dates"], infer_datetime_format=True)
cars.info() #Now we have a datetime

In [None]:
sample = cars.iloc[0:1, 4][0]
print(sample,"\n", sample.year, sample.month, sample.day)
#Here we can see it has worked correctly

In [None]:
#Therefore we could create a new column out of the dateparts
cars["month"] = pd.DatetimeIndex(cars["dates"]).month
cars

In [None]:
#We can also add time to datetime objects now
cars["new_dates"] = (cars["dates"] + pd.Timedelta(weeks=52))
cars

#### 1.5.3 Other (astype)

It is also possible to simply use the df.astype() function. This asks for the following:

* dtype
    * object, int64, float64, datetime64, bool
* copy
    * Whether to return a copy. Default is true.

A good example is here http://pbpython.com/pandas_dtypes.html

Note there is also a 'category' data type that has some benefits for memory optimisation. It is best used when the number of categories is **fixed** and **finite**. It is a form of *dynamic enumeration* and hence can have speed increases over using objects. There is also the possibility to order categorical variables.

The pandas documentation notes the following.

The categorical data type is useful in the following cases:

* A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
* The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
* As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

## 2. Summary Stats + Quick Checks

Generally it should be noted that these checks have been superceded by the pandas_profilling package

### 2.1 Shape, Describe, Info

In [None]:
#check some attributes like the shape
print(wine.shape)
#And some summary statistics
wine.describe()

In [None]:
#We could also natively check the row numbers by using len
len(wine)

In [None]:
#The info tab allows us to see if the data has the right types
wine.info()

In [None]:
#Note that we can ask for the dtype attribute on any particular series as well
wine.density.dtype

In [None]:
#Or we can call dtypes on a df
wine.dtypes

### 2.2 Head and Tail

These are useful for getting a quick glimpse of our df or series

In [None]:
print(wine.head(3).iloc[:,0:3])
print(wine.tail(3).iloc[:,0:3])
#Note that this returns a df when called on a df so we can slice the result from our call

### 2.3 Using 'in' for checking

A simple way to see if a value or category is in a series (df column) we can simply ask if it is in.

In [None]:
5.9 in wine["fixed acidity"]

### 2.4 Value Counts

This will just be a simple note on the series function for frequency counts for a categorical variable. This is a useful function to quickly check balance and the presence of nulls. The more advanced **crosstab** function is found in the reshaping section

In [None]:
#To explore any given categorical variable we can do the value counts. Usually a good idea to have the dropna=False
wine["red_white"].value_counts(dropna=False)

### 2.5 Unique checks

A nice little function to check the unique values in a column is simply unique. The documentation notes 'Significantly faster than numpy.unique. Includes NA values.'

In [None]:
pd.unique(wine.quality_binary)

In [None]:
#This can also be done as a series method
wine.quality_binary.unique()

### 2.6 Useful Math Methods

#### 2.6.1 Correlation matrix

Pandas has a nice built-in correlation matrix that can be undertaken natively on a df

In [None]:
wine.corr()

#### 2.6.2 Min, Max, Median etc

In [8]:
#We can isolate out the min,mx, median etc by selecting the series object and applying the method directly
print(wine["fixed acidity"].min())
print(wine["fixed acidity"].max())

#there is also simple descriptive stats
print(wine["fixed acidity"].mean())
print(wine["fixed acidity"].median())
print(wine["fixed acidity"].var()) #variance
print(wine["fixed acidity"].std()) #standard deviation


3.8
15.9
7.215307064799138
7.0
1.6807404883629504
1.2964337577998155


#### 2.6.3 Other useful math

Some useful math operators would be:
* Series.add
  * Addition of series and other, element-wise (binary operator add).
* Series.sub
  * Subtraction of series and other, element-wise (binary operator sub).
* Series.mul
  * Multiplication of series and other, element-wise (binary operator mul).
* Series.div
  * Floating division of series and other, element-wise (binary operator truediv).
* Series.mod
  * Modulo of series and other, element-wise (binary operator mod).
* Series.pow
  * Exponential power of series and other, element-wise (binary operator pow).
* Series.round
  * Round each value in a Series to the given number of decimals.
* Series.lt
  * Less than of series and other, element-wise (binary operator lt).
* Series.gt
  * Greater than of series and other, element-wise (binary operator gt).
* Series.le
  * Less than or equal to of series and other, element-wise (binary operator le).
* Series.ge
  * Greater than or equal to of series and other, element-wise (binary operator ge).
* Series.ne
  * Not equal to of series and other, element-wise (binary operator ne).
* Series.eq
  * Equal to of series and other, element-wise (binary operator eq).

## 3. Strings

### 3.1 Regex extraction

In [57]:
cars_reg = cars.copy()

In [58]:
#We can do a regex match on strings to extract.
cars_reg["country"].str.contains('ia')

0    False
1     True
2    False
3     True
4     True
5    False
6    False
Name: country, dtype: bool

In [59]:
cars_reg = cars_reg[cars["country"].str.contains('ia')]
cars_reg

Unnamed: 0,country,drives_right,cars_per_cap
1,Australia,False,731
3,India,False,18
4,Russia,True,200


### 3.2 Editing strings

Some other useful functions to edit strings would be:
* Series.str.capitalize()
    * Convert strings in the Series/Index to be capitalized.
* Series.str.count(pat[, flags])
    * Count occurrences of pattern in each string of the Series/Index.
* Series.str.join(sep)
    * Join lists contained as elements in the Series/Index with passed delimiter.
* Series.str.len()
    * Compute length of each string in the Series/Index.
* Series.str.lower()
    * Convert strings in the Series/Index to lowercase.
* Series.str.repeat(repeats)
    * Duplicate each string in the Series/Index by indicated number of times.
* Series.str.replace(pat, repl[, n, case, …])
    * Replace occurrences of pattern/regex in the Series/Index with some other string.
* Series.str.slice([start, stop, step])
    * Slice substrings from each element in the Series/Index
* Series.str.slice_replace([start, stop, repl])
    * Replace a positional slice of a string with another value.
* Series.str.split([pat, n, expand])
    * Split strings around given separator/delimiter.
* Series.str.strip([to_strip])
    * Strip whitespace (including newlines) from each string in the Series/Index from left and right sides.
* Series.str.title()
    * Convert strings in the Series/Index to titlecase.
* Series.str.translate(table[, deletechars])
    * Map all characters in the string through the given mapping table.
* Series.str.upper()
    * Convert strings in the Series/Index to uppercase.
* Series.str.isalnum()
    * Check whether all characters in each string in the Series/Index are alphanumeric.
    * there are a variety of other similar checks that can be done