# Introduction to Jupyter

### Keyboard shortcuts

- Pres Esc key to enable / disable
- Executing cells - Esc, Shft-Enter OR Ctrl-Enter
- Adding a cell above Esc, A
- Adding a cell below - Esc, B
- Delete a cell - Esc, D, D
- Copy, Cut, Paste - Ctrl-C, Ctrl-X, Ctrl-V
- Undo - Ctrl-Z
- Select all - Ctrl-A
- Comment Cell - Ctrl-/

# Data Analysis with Python

### Why use Python for Data Analysis?

#### Python vs R
- Infographic:  https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis
- You can do most things in both languages
- R was designed with statistical analyiss in mind / Python is a more general purpose language
- Python is easier to learn and read
- R is great for building standalone pipelines / Python is more flexible (Scripting / Web Applications)
- You can run R code from Python and vice versa! (rPython in R, RPy2 in Python)
- R built in data structures for large dataset (or use data.table package) / Python (pandas or NumPy)
- Visualizations easier in R, but not easily integrated into Web Apps
- Python is easier to learn
- Both have great community support

#### NumPy
- Used for matrices with only one data type (Ex. gene expression data)
- http://www.numpy.org/

#### Pandas Dataframes
- Used like a database, can store multiple different datatypes with column headers and row names
- http://pandas.pydata.org/

#### Plotting and Visualization - Matplotlib, plotly and Seaborn
- Many good visualization libraries

#### Machine Learning - Scikit Learn
- Take our Machine Learning with Python Course!

# Steps to Analyze Your Data

1. Loading your data (csv, databases, text file, JSON/XML)
2. Data Exploration
3. Cleaning data
4. Normalizing Data
5. Analysis
6. Plotting and Visualization

# Managing Packages

### PyPi
- PyPi is the index of publicly available packages
- https://pypi.org/

### Pip
- Check if pip is installed
    1. Open a command prompt (Windows) or terminal (Mac)
    2. Type 'python -V'
    3. Type 'pip -V'
- pip is the tool used for managing packages in Python
- Installed with python3
- Install all packages that are not a part of standard python library
- https://pypi.org/project/pip/

### Package we will use in this course
- numpy, pandas, matplotlib, seaborn
- Check what packages are installed with help("modules")
- Install packages from the command line
    1. Open a command prompt (Windows) or Terminal (Mac)
    2. Type 'pip install <package-name>' (ex. pip install pandas, pip install numpy)

### Importing packages
- use import statement to allow your code to use a package

In [None]:
# What packages are already installed?
help("modules")

In [None]:
import numpy
import pandas as pd

# Loading your Data

### From a Database

#### Tested packages for connecting to databases
1. MySQL (mysql.connector)
2. SQL Server / Sybase (pymssql)
3. DB2 (ibm_db)
4. PostgreSQL (psycopg2)
5. MongoDB / Other NoSQL databases (pymongo)


### From a file into Pandas
- Read from a csv file
read_csv()
- Read from an excel file
read_excel()
- Read from a json file
read_json()

In [None]:
# Read data from a csv file
df = pd.read_csv('./data/breast_cancer.csv')

# Data Exploration with Pandas Dataframes

- Get to know your data
- number of rows, columns
- what does the data look like?
- summary of data
- what types are in our dataset?

In [None]:
# What is the shape of your data?
df.shape

In [None]:
# What do the first five rows look like?
df.head()

In [None]:
# What do the first 20 rows look like?
# Type your answer here:

In [None]:
# TODO: What do the last five rows look like?
# Type your answer here:


In [None]:
# Get summary statistics on your dataset
df.describe()

In [None]:
# What data types are in our dataset?
df.dtypes

## Cleaning Data

In [None]:
# Let's look at the first 10 rows of data
# Type code here:


In [None]:
# Identify missing values
print(df.isnull())
print(df.isna())

In [None]:
# Are there any missing values?
print(df.isnull().values.any())

In [None]:
# Which columns contain missing values?
print(df.columns[df.isnull().any()].tolist())

In [None]:
# Replace null values with 0
df_clean = df.fillna(0.0)
df_clean.head()

In [None]:
# Drop a single column, create a new dataframe
df_clean = df_clean.drop(columns=['bad_data'])
df_clean.head()

# Manipulating Data with Pandas Dataframes

In [74]:
# Retrieve the columns
columns = df_clean.columns
print(columns)
print(type(columns))

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')
<class 'pandas.core.indexes.base.Index'>


In [75]:
# Convert a pandas index to a Python list()
columns = list(df_clean.columns)
print(columns)
print(type(columns))

['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
<class 'list'>


## Creating a Correlation

In [76]:
# Use pandas built-in correlation function
correlation = df.corr()
print(correlation)

                               id  radius_mean  texture_mean  perimeter_mean  \
id                       1.000000     0.074626      0.099770        0.073159   
radius_mean              0.074626     1.000000      0.323782        0.997855   
texture_mean             0.099770     0.323782      1.000000        0.329533   
perimeter_mean           0.073159     0.997855      0.329533        1.000000   
area_mean                0.096893     0.987357      0.321086        0.986507   
smoothness_mean         -0.012968     0.170581     -0.023389        0.207278   
compactness_mean         0.000096     0.506124      0.236702        0.556936   
concavity_mean           0.050080     0.676764      0.302418        0.716136   
concave points_mean      0.044158     0.822529      0.293464        0.850977   
symmetry_mean           -0.022114     0.147741      0.071401        0.183027   
fractal_dimension_mean  -0.052511    -0.311631     -0.076437       -0.261477   
radius_se                0.143048     0.

# Plotting and Visualization

# Export Your Script

File -> Download As -> Python (.py)

# Next Steps

Take our next course!

TODO: Add resources for learning