# Python Data Science tools
- Author: Christopher Harrison, Susan Ibach  
- [Github](https://github.com/microsoft/c9-python-getting-started): sample code and slides  
- [Website](https://aka.ms/pythonbeginnerseries): video course  
- [Bilibili](https://www.bilibili.com/video/BV1qa4y1Y7CD): video course in Chinese 
- Learning Website: [https://channel9.msdn.com/](https://channel9.msdn.com/) 

---

## 0. Intro
Who are you?
- Done a little bit of Python
- Looking to explore data science
- Trying to understand the code in a data science tutorial or quick start

What are we doing here?
- Introduce data python libraries and objects
- Walk commonly used tools and techniques
- Navigate through some common scenarios

## 1. Jupyter Notebooks

Jupyter Notebooks are an open source web application that allows you to create and share Python code. 
- Open source
- Interactive development environment
- Download JupyterLab from [jupyter.org](jupyter.org)

### 1.1 Documentation

- [Jupyter](https://jupyter.org/) to install Jupyter so you can run Jupyter Notebooks locally on your computer
- [Jupyter Notebook viewer](https://nbviewer.jupyter.org/) to view Jupyter Notebooks in this GitHub repository without installing Jupyter
- [Azure Notebooks](https://notebooks.azure.com/) to create a free Azure Notebooks account to run Notebooks in the cloud
- [Create and run a notebook](https://docs.microsoft.com/azure/notebooks/tutorial-create-run-jupyter-notebook?WT.mc_id=python-c9-niner) is a tutorial that walks you through the process of using Azure Notebooks to create a complete Jupyter Notebook that demonstrates linear regression
- [How to create and clone projects](https://docs.microsoft.com/azure/notebooks/create-clone-jupyter-notebooks?WT.mc_id=python-c9-niner) to create a project
- [Manage and configure projects in Azure Notebooks](https://docs.microsoft.com/azure/notebooks/configure-manage-azure-notebooks-projects?WT.mc_id=python-c9-niner) to upload Notebooks to your project

### 1.2 Microsoft Learn Resources

Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).

- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

### 1.3 Usage
**Keyboard Shortcut**:
- Enter: enter edit mode
- Shift-­Enter: run cell, select below
- Ctrl-Enter: run cell
- Alt-Enter: run cell, insert below
- Y: to code
- M: to markdown
- R: to raw
- A/B: insert cell above/­below
- X: cut selected cell
- C: copy selected cell
- Shift-V: paste cell above
- V: paste cell below
- D,D: delete selected cell
- Shift-M: merge cell below

If the last line of code in a cell is the name of a variable, Jupyter will display the contents of the variable. You can use this feature to preview an object.

`In [ ]`:
- Blank: whatever in the cell has not run
- `*`: executing
- Number: the order in which that cell has executed


## 2. Anaconda and Conda

### 2.1 Anaconda

[Anaconda](https://www.anaconda.com/) is an open source distribution of Python and R for data science. It includes:
- more than 1500 packages
- a graphical interface called Anaconda Navigator
- a command line interface called Anaconda prompt 
- a tool called Conda.

### 2.2 Conda

Python code often relies on external libraries stored in packages. Conda is an open source package management system and environment management system. Conda helps you manage environments and install packages for Jupyter Notebooks.

#### 2.2.1 Create a virtual environment with conda
`conda create --name my_notebook_env python=3.7 –y`:

- `--name my_notebook_env`: the name of the environment
- `python=3.7`: version of python in the environment
- `–y`: responds yes automatically to prompts

#### 2.2.2 Activate your virtual environment
- `conda activate my_notebook_env`: activate a virtual environment: 
- `conda deactivate`: deactivate the current active environment
- `conda remove --name my_notebook_env --all`: permanently delete the virtual environment

#### 2.2.3 Install library in the active environment
```
conda install pandas

Install jupyter
# conda install jupyter

conda list 
# will return a list of installed libraries
```

#### 2.2.4
`jupyter notebook`: launch Jupyter notebooks

### 2.3 Documentation

- [Conda home page](https://docs.conda.io/)
- [Managing Conda environments](https://docs.conda.io/projects/conda/latest/user-guide/tasks/manage-environments.html) to find links and instructions for creating Conda environments, activating, and de-activating Conda environments 
- [Managing packages](https://docs.conda.io/projects/conda/latest/user-guide/getting-started.html#managing-packages) to learn how to install packages in a Conda environment
- [Conda cheat sheet](https://docs.conda.io/projects/conda/latest/user-guide/cheatsheet.html) is a handy quick reference of common Conda commands

## 3. Pandas

[pandas](https://pandas/pydata.org​) is an open source/BSD-licensed Python library contains a number of high performance data structures and tools for data analysis.


**Documentation**: 

- [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) stores one dimensional arrays
- [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) stores two dimensional arrays and can contain different datatypes


**Microsoft Learn Resources**:

Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).

- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

In [None]:
import pandas as pd

### 3.1 Series

pandas Series is similar to Python list. `pd.Series` converts a list to pandas series.

- One dimensional array of objects
- Each row as an index
- By default index is zero-based

In [None]:
# Create a series
airports = pd.Series([
             'Seattle-Tacoma',
             'Dulles',
             'London Heathrow',
             'Schiphol'
           ])

# indexing
airports[2] # have single quotes here, because j-notebook controls the output

# iterate through all the values
for value in airports:
    print(value) # have no quote, because the console control the output


### 3.2 DataFrame
pandas DataFrame is Similar to a table in a database or spreadsheet
- Coumns identified by name
- Rows of data
- Index column

In [None]:
# Create a DataFrame
airports = pd.DataFrame([
    # 0                  1         2
   ['Seattle-Tacoma', 'Seattle', 'USA' ],           # row 1
   ['Dulles', 'Washington', 'USA'],                 # row 2
   ['London Heathrow', 'London', 'United Kingdom'], # row 3
   ['Schiphol', 'Amsterdam', 'Netherlands']],       # row 4
   columns= ['Name','City','Country'] # By default columns are labelled by numbers
 ) 
airports

airports = pd.DataFrame([
   ['Seattle-Tacoma', 'Seattle', 'USA' ], 
   ['Dulles', 'Washington', 'USA'],       
   ['London Heathrow', 'London', 'United Kingdom'],
   ['Schiphol', 'Amsterdam', 'Netherlands']
]) 
airports

#### 3.2.1 Explore DataFrame

**Common functions** to explore contents and structure of a DataFrame:

- [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) returns the first *n* rows from the DataFrame
- [tail](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) returns the last *n* rows from the DataFrame
- [shape](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) returns the dimensions of the DataFrame (e.g. number of rows and columns)
- [info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) provides a summary of the DataFrame content including column names, their datatypes, and number of rows containing non-null values

In [None]:
# Explore contents and structure of a DataFrame
airports.head(3)
airports.tail(3)
airports.shape # 7 rows, 3 columns
airports.info() # strings are datatype object

#### 3.2.2 Query a DataFrame

- [loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html): `iloc[rows, columns]` returns specific rows and columns by specifying column names
- [iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) returns specific rows and columns by specifying column positions

In [None]:
# Column
airports['City'] # Request a single column
airports[['Name', 'Country']] # Request a list of columns

# Rows and columns (by position)
airports.iloc[0,0] # data of row 0 and column 0
airports.iloc[:,:] # all rows or columns
airports.iloc[0:2,:] # first 2 rows and all columns
airports.iloc[:,[0,2]] # all rows and column 0 and column 2 (Name and Country)

# Rows and columns (by name)
airports.loc[:,['Name', 'Country']] # all rows and the columns Name and Country

#### 3.3 Read and write CSV file 

CSV files are comma separated variable file. CSV files are frequently used to store data. In order to access the data in a CSV file from a Jupyter Notebook you must upload the file.

- [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) reads a comma-separated values file into a DataFrame
- [to_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) writes contents of a DataFrame to a comma-separated values file
- [NaN](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) is the default representation of missing values

In [None]:
# read
airports_df = pd.read_csv('Data/airports.csv') # loads a DataFrame from a CSV
airports_df = pd.read_csv('Data/airportsInvalidRows.csv', 
                          error_bad_lines=False) # skip rows with errors: eg. extra column
airports_df = pd.read_csv('Data/airportsNoHeaderRows.csv',
                          header=None) # the first row does not contain column headers
airports_df = pd.read_csv('Data/airportsNoHeaderRows.csv',
                          header=None,
                          names=['Name','City','Country']) # Provide names for the columns
airports_df = pd.read_csv('Data/airportsBlankValues.csv') # missing values appear as NaN

# write
airports_df.to_csv('Data/MyNewCSVFile.csv') # exclude the index column

### 3.4 Data manipulation
#### 3.4.1 Remove a column
- [`drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) deletes specified columns from a DataFrame. Use the **inplace** parameter to specify you want to drop the column from the original DataFrame

In [None]:
# remove a column and create a new dataframe
new_df = df.drop(columns=['COL_NAME'])
# df = df.drop(['COL_NAME'], axis=1) : doesn't work

# drop the column from the original DataFrame
df.drop(columns=['COL_NAME'], inplace=True)

#### 3.4.2 Handling duplicates and rows with missing values

- [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) removes rows with missing values
- [duplicated](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) returns a True or False to indicate if a row is a duplicate of a previous row
- [drop_duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) returns a DataFrame with duplicate rows removed

**Missing values**: `NaN`

In [None]:
# check missing values
df.info()

# create a new dataframe with no missing value
no_nulls_df = delays_df.dropna()

# Specify inplace=True to update the existing DataFrame
df.dropna(inplace=True)

# Check the number of rows and number of rows with non-null values to confirm
no_nulls_df.info()                 

**Duplicate rows**

In [None]:
# check duplicate rows
df.duplicated() # If a row is a duplicate of a previous row, it returns True

# drop dulicate rows
df.drop_duplicates(inplace=True)

## 4. Modeling