# Python Data Science tools
- Author: Christopher Harrison, Susan Ibach  
- [Github](https://github.com/microsoft/c9-python-getting-started): sample code and slides  
- [Website](https://aka.ms/pythonbeginnerseries): video course  
- [Bilibili](https://www.bilibili.com/video/BV1qa4y1Y7CD): video course in Chinese 
- Learning Website: [https://channel9.msdn.com/](https://channel9.msdn.com/) 

---

## 0. Intro
Who are you?
- Done a little bit of Python
- Looking to explore data science
- Trying to understand the code in a data science tutorial or quick start

What are we doing here?
- Introduce data python libraries and objects
- Walk commonly used tools and techniques
- Navigate through some common scenarios

## 1. Jupyter Notebooks

Jupyter Notebooks are an open source web application that allows you to create and share Python code. 
- Open source
- Interactive development environment
- Download JupyterLab from [jupyter.org](jupyter.org)

### 1.1 Documentation

- [Jupyter](https://jupyter.org/) to install Jupyter so you can run Jupyter Notebooks locally on your computer
- [Jupyter Notebook viewer](https://nbviewer.jupyter.org/) to view Jupyter Notebooks in this GitHub repository without installing Jupyter
- [Azure Notebooks](https://notebooks.azure.com/) to create a free Azure Notebooks account to run Notebooks in the cloud
- [Create and run a notebook](https://docs.microsoft.com/azure/notebooks/tutorial-create-run-jupyter-notebook?WT.mc_id=python-c9-niner) is a tutorial that walks you through the process of using Azure Notebooks to create a complete Jupyter Notebook that demonstrates linear regression
- [How to create and clone projects](https://docs.microsoft.com/azure/notebooks/create-clone-jupyter-notebooks?WT.mc_id=python-c9-niner) to create a project
- [Manage and configure projects in Azure Notebooks](https://docs.microsoft.com/azure/notebooks/configure-manage-azure-notebooks-projects?WT.mc_id=python-c9-niner) to upload Notebooks to your project

### 1.2 Microsoft Learn Resources

Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).

- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

### 1.3 Usage
**Keyboard Shortcut**:
- Enter: enter edit mode
- Shift-­Enter: run cell, select below
- Ctrl-Enter: run cell
- Alt-Enter: run cell, insert below
- Y: to code
- M: to markdown
- R: to raw
- A/B: insert cell above/­below
- X: cut selected cell
- C: copy selected cell
- Shift-V: paste cell above
- V: paste cell below
- D,D: delete selected cell
- Shift-M: merge cell below

If the last line of code in a cell is the name of a variable, Jupyter will display the contents of the variable. You can use this feature to preview an object.

`In [ ]`:
- Blank: whatever in the cell has not run
- `*`: executing
- Number: the order in which that cell has executed


## 2. Anaconda and Conda

### 2.1 Anaconda

[Anaconda](https://www.anaconda.com/) is an open source distribution of Python and R for data science. It includes:
- more than 1500 packages
- a graphical interface called Anaconda Navigator
- a command line interface called Anaconda prompt 
- a tool called Conda.

### 2.2 Conda

Python code often relies on external libraries stored in packages. Conda is an open source package management system and environment management system. Conda helps you manage environments and install packages for Jupyter Notebooks.

#### 2.2.1 Create a virtual environment with conda
`conda create --name my_notebook_env python=3.7 –y`:

- `--name my_notebook_env`: the name of the environment
- `python=3.7`: version of python in the environment
- `–y`: responds yes automatically to prompts

#### 2.2.2 Activate your virtual environment
- `conda activate my_notebook_env`: activate a virtual environment: 
- `conda deactivate`: deactivate the current active environment
- `conda remove --name my_notebook_env --all`: permanently delete the virtual environment

#### 2.2.3 Install library in the active environment
```
conda install pandas

Install jupyter
# conda install jupyter

conda list 
# will return a list of installed libraries
```

#### 2.2.4
`jupyter notebook`: launch Jupyter notebooks

### 2.3 Documentation

- [Conda home page](https://docs.conda.io/)
- [Managing Conda environments](https://docs.conda.io/projects/conda/latest/user-guide/tasks/manage-environments.html) to find links and instructions for creating Conda environments, activating, and de-activating Conda environments 
- [Managing packages](https://docs.conda.io/projects/conda/latest/user-guide/getting-started.html#managing-packages) to learn how to install packages in a Conda environment
- [Conda cheat sheet](https://docs.conda.io/projects/conda/latest/user-guide/cheatsheet.html) is a handy quick reference of common Conda commands

## 3. Pandas

[pandas](https://pandas/pydata.org​) is an open source/BSD-licensed Python library contains a number of high performance data structures and tools for data analysis.


**Documentation**: 

- [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) stores one dimensional arrays
- [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) stores two dimensional arrays and can contain different datatypes


**Microsoft Learn Resources**:

Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).

- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

In [None]:
import pandas as pd

### 3.1 Series

pandas Series is similar to Python list. `pd.Series` converts a list to pandas series.

- One dimensional array of objects
- Each row as an index
- By default index is zero-based

In [None]:
# Create a series
airports = pd.Series([
             'Seattle-Tacoma',
             'Dulles',
             'London Heathrow',
             'Schiphol'
           ])

# indexing
airports[2] # have single quotes here, because j-notebook controls the output

# iterate through all the values
for value in airports:
    print(value) # have no quote, because the console control the output


### 3.2 DataFrame
pandas DataFrame is Similar to a table in a database or spreadsheet
- Coumns identified by name
- Rows of data
- Index column

In [None]:
# Create a DataFrame
airports = pd.DataFrame([
    # 0                  1         2
   ['Seattle-Tacoma', 'Seattle', 'USA' ],           # row 1
   ['Dulles', 'Washington', 'USA'],                 # row 2
   ['London Heathrow', 'London', 'United Kingdom'], # row 3
   ['Schiphol', 'Amsterdam', 'Netherlands']],       # row 4
   columns= ['Name','City','Country'] # By default columns are labelled by numbers
 ) 
airports

airports = pd.DataFrame([
   ['Seattle-Tacoma', 'Seattle', 'USA' ], 
   ['Dulles', 'Washington', 'USA'],       
   ['London Heathrow', 'London', 'United Kingdom'],
   ['Schiphol', 'Amsterdam', 'Netherlands']
]) 
airports

#### 3.2.1 Explore DataFrame

**Common functions** to explore contents and structure of a DataFrame:

- [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) returns the first *n* rows from the DataFrame
- [tail](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) returns the last *n* rows from the DataFrame
- [shape](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) returns the dimensions of the DataFrame (e.g. number of rows and columns)
- [info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) provides a summary of the DataFrame content including column names, their datatypes, and number of rows containing non-null values

In [None]:
# Explore contents and structure of a DataFrame
airports.head(3)
airports.tail(3)
airports.shape # 7 rows, 3 columns
airports.info() # strings are datatype object

#### 3.2.2 Query a DataFrame

- [loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html): `iloc[rows, columns]` returns specific rows and columns by specifying column names
- [iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) returns specific rows and columns by specifying column positions

In [None]:
# Column
airports['City'] # Request a single column
airports[['Name', 'Country']] # Request a list of columns

# Rows and columns (by position)
airports.iloc[0,0] # data of row 0 and column 0
airports.iloc[:,:] # all rows or columns
airports.iloc[0:2,:] # first 2 rows and all columns
airports.iloc[:,[0,2]] # all rows and column 0 and column 2 (Name and Country)

# Rows and columns (by name)
airports.loc[:,['Name', 'Country']] # all rows and the columns Name and Country

#### 3.3 Read and write CSV file 

CSV files are comma separated variable file. CSV files are frequently used to store data. In order to access the data in a CSV file from a Jupyter Notebook you must upload the file.

- [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) reads a comma-separated values file into a DataFrame
- [to_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) writes contents of a DataFrame to a comma-separated values file
- [NaN](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) is the default representation of missing values

In [None]:
# read
airports_df = pd.read_csv('Data/airports.csv') # loads a DataFrame from a CSV
airports_df = pd.read_csv('Data/airportsInvalidRows.csv', 
                          error_bad_lines=False) # skip rows with errors: eg. extra column
airports_df = pd.read_csv('Data/airportsNoHeaderRows.csv',
                          header=None) # the first row does not contain column headers
airports_df = pd.read_csv('Data/airportsNoHeaderRows.csv',
                          header=None,
                          names=['Name','City','Country']) # Provide names for the columns
airports_df = pd.read_csv('Data/airportsBlankValues.csv') # missing values appear as NaN

# write
airports_df.to_csv('Data/MyNewCSVFile.csv') # exclude the index column

### 3.4 Data manipulation
#### 3.4.1 Remove a column
- [`drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) deletes specified columns from a DataFrame. Use the **inplace** parameter to specify you want to drop the column from the original DataFrame

In [None]:
# remove a column and create a new dataframe
new_df = df.drop(columns=['COL_NAME'])
# df = df.drop(['COL_NAME'], axis=1) : doesn't work

# drop the column from the original DataFrame
df.drop(columns=['COL_NAME'], inplace=True)

#### 3.4.2 Handling duplicates and rows with missing values

- [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) removes rows with missing values
- [duplicated](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) returns a True or False to indicate if a row is a duplicate of a previous row
- [drop_duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) returns a DataFrame with duplicate rows removed

**Missing values**: `NaN`

In [None]:
# check missing values
df.info()

# create a new dataframe with no missing value
no_nulls_df = delays_df.dropna()

# Specify inplace=True to update the existing DataFrame
df.dropna(inplace=True)

# Check the number of rows and number of rows with non-null values to confirm
no_nulls_df.info()                 

**Duplicate rows**

In [None]:
# check duplicate rows
df.duplicated() # If a row is a duplicate of a previous row, it returns True

# drop dulicate rows
df.drop_duplicates(inplace=True)

## 4. Modeling

[scikit-learn](https://scikit-learn.org/) is a library of tools for predictive data analysis, which will allow you to prepare your data for machine learning and create models.

### 4.1 Splitting test and training data

#### 4.1.1 Split data into features and labels

Create a DataFrame called X containing only the **features** we want to use to train our model. The features are the columns we think can help us predict how late a flight will arrive: `DISTANCE` and `CRS_ELAPSED_TIME`

In [None]:
X = df.loc[:,['DISTANCE', 'CRS_ELAPSED_TIME']]

Create a DataFrame called y containing only the value (**label**) we want to predict with our model. The label is the columns we want to predict with our trained model: ARR_DELAY

In [None]:
y = df.loc[:,['ARR_DELAY']]

#### 4.1.2 Split into test and training data

Use scikitlearn `train_test_split` to move 30% of the rows into Test DataFrames. The other 70% of the rows into DataFrames we can use to train our model

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( # use row id to match things up
                                                    X, 
                                                    y, 
                                                    test_size=0.3, 
                                                    random_state=42 # seed
                                                   )

- **X_train**: contains 70% of the rows to train our model
- **X_test**: contains the remaining 30% of the rows to test our trained model, so we can check it's accuracy
- **y_train**: contains 70% of the rows to train our model
- **y_test**: contains the remaining 30% of the rows to test our trained model, so we can check it's accuracy

### 4.2 Train a linear regression 
Use **Scikitlearn LinearRegression** `fit` method to train a linear regression model based on the training data stored in X_train and y_train

- [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) fits a linear model
- [LinearRegression.fit](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linearregression#sklearn.linear_model.LinearRegression.fit) is used to fit the linear model based on training data

In [None]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()     # Create a scikit learn LinearRegression object
regressor.fit(X_train, y_train)    # Use the fit method to train the model using your training data

### 4.3 Testing a model

Once a model is built it can be used to predict values. You can provide new values to see where it would fall on the spectrum, and test the generated model.

- [LinearRegression.predict](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linearregression#sklearn.linear_model.LinearRegression.predict) is used to predict outcomes for new data based on the trained linear model

In [None]:
y_pred = regressor.predict(X_test)
# Compare the predicted values to the actual values in y_test to see how well your model performs
y_pred
y_test
# y_pred and y_test are of different types
# type(y_pred)
# type(y_test)

### 4.4 Evaluating accuracy of a model

The MSE is the average error performed by the model when predicting the outcome for an observation. The lower the MSE, the better the model.  

**MSE = mean((actual - predicted)**2)**

Root Mean Squared Error (RMSE) the square root of the mean squared error

**RMSE = sqrt(MSE)**

**R squared** is the proportion of variation in the outcome that is explained by the predictor variables. It is an indication of how much the values passed to the model influence the predicted value. The Higher the R-squared, the better the model.

- [metrics](https://scikit-learn.org/stable/modules/classes.html?highlight=metrics#module-sklearn.metrics) includes functions and metrics that can be used for data science including measuring accuracy of models
- [mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) returns the mean squared error, a measure used to measure accuracy of linear regression models
- [r2_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score) returns the R^2 regression score, a measure used to measure accuracy of linear regression models
- [numpy.sqrt](https://numpy.org/doc/1.18/reference/generated/numpy.sqrt.html?highlight=sqrt#numpy.sqrt) returns the square root of a value

In [None]:
# MSE
from sklearn import metrics
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
# RMSE
import numpy as np
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
# R^2
print('R^2: ',metrics.r2_score(y_test, y_pred))

## 5. Numpy vs Pandas
[NumPy](https://numpy.org/) is a Python package for scientific computing that includes a array and dictionary type objects for data analysis.

Common object:
- [array](https://numpy.org/doc/1.18/reference/generated/numpy.array.html?highlight=array#numpy.array) creates an N-dimensional array object


[pandas](https://pandas.pydata.org/) is a Python package for data analysis that includes a 1 dimensional and 2 dimensional array objects

Common objects：  

- [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) stores a one dimensional array
- [DataFrame](https://pandas.pydata.org/docs/reference/frame.html) stores a two-dimensional array

**Difference**:

- different classes
- NumPy has implicit index, Pandas has explicit index


In [None]:
# Numpy array vs Pandas series
import numpy as np
airports_array = np.array(['Pearson','Changi','Narita'])
print(airports_array)
print(airports_array[2])

airports_series = pd.Series(['Pearson','Changi','Narita'])
print(airports_series)
print(airports_series[2])

# two dimensional numpy array vs pandas DataFrame
airports_array = np.array([
  ['YYZ','Pearson'],
  ['SIN','Changi'],
  ['NRT','Narita']])
print(airports_array)
print(airports_array[0,0])

airports_df = pd.DataFrame([['YYZ','Pearson'],['SIN','Changi'],['NRT','Narita']])
print(airports_df)
print(airports_df.iloc[0,0])

# convert NumPy array to Pandas Dataframe
predicted_df = pd.DataFrame(y_pred)

## 6. Visualizing data with Matplotlib

[Matplotlib](https://matplotlib.org/) gives you the ability to draw charts which can be used to visualize data.

- [pyplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html?highlight=pyplot#module-matplotlib.pyplot) provides the ability to draw plots similar to the MATLAB tool
- [pyplot.plot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot) plots a graph
- [pyplot.show](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.show.html#matplotlib.pyplot.show) displays figures such as a graph
- [pyplot.scatter](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html?highlight=scatter%20plot#matplotlib.pyplot.scatter) is used to draw scatter plots, a diagram that shows the relationship between two sets of data

### 6.1 Scatter plot: checking for correlation with visualization

Common plot used in data science is the scatter plot for checking the relationship between two columns If you see dots scattered everywhere, there is no correlation between the two columns If you see somethign resembling a line, there is a correlation between the two columns

You can use the plot method of the DataFrame to draw the scatter plot:

kind - the type of graph to draw  
x - value to plot as x  
y - value to plot as y  
color - color to use for the graph points  
alpha - opacity - useful to show density of points in a scatter plot  
title - title of the graph  

In [None]:
#Check if there is a relationship between the distance of a flight and how late the flight arrives
delays_df.plot(
               kind='scatter',
               x='DISTANCE',
               y='ARR_DELAY',
               color='blue',
               alpha=0.3, # transparency
               title='Correlation of arrival and distance'
              )
plt.show()

In [None]:
#Check if there is a relationship between the how late the flight leaves and how late the flight arrives
delays_df.plot(
               kind='scatter',
               x='DEP_DELAY',
               y='ARR_DELAY',
               color='blue',
               alpha=0.3,
               title='Correlation of arrival and departure delay'
              )
plt.show()

# 1. alternative: break into multi-lines
plt.xlabel('Departure delay (minutes)')
plt.ylabel('Arrival delay (minutes)')
plt.title('Correlation Departure and Arrival Delay')
plt.scatter(
            x=delays_df['DEP_DELAY']，
            y=delays_df['ARR_DELAY'],
            color='blue', 
            alpha=0.3
            )
plt.show()

# 2. draw a line to show the predicted values of a trained model
plt.xlabel('Departure delay (minutes)')
plt.ylabel('Arrival delay (minutes)')
plt.title('Predicted Arrival Delay')
plt.plot(
            X_test，
            y_pred,
            color='red', 
            linewidth=2
            )
plt.show()

# 3. overlay 2 on 1
plt.xlabel('Departure delay (minutes)')
plt.ylabel('Arrival delay (minutes)')
plt.scatter(x=delays_df['DEP_DELAY']，y=delays_df['ARR_DELAY'], color='blue', alpha=0.3)
plt.plot(X_test，y_pred, color='red', linewidth=2)
plt.show()
