## **DATA SCIENCE WITH PYTHON TAPAS**

*   First, download the new desktop version of Jupyter Lab at https://github.com/jupyterlab/jupyterlab-desktop
* A variety of medical datasets can be found at https://informaticseducation.org/file-share
*   In this tutorial we will jump right into analyzing data, using pandas plus newer AutoML packages,  with initial emphasis on exploratory data analysis
*   With some of the newer AutoML Python packages you can analyze data with only a few lines of code
*   We will also include PyCaret which is a complete and automated machine learning pipeline 





## **Menus**


1. Menu at top of notebook

* Click on the down arrow next to Code or Markdown above. This will create a new cell below for coding or wroting text

* Markdown is a type of language used to annotate. Tips how to write markdown can be found here.https://www.earthdatascience.org/courses/intro-to-earth-data-science/file-formats/use-text-files/format-text-with-markdown-jupyter-notebook/

* Save icon above

* Plus icon to insert a new cell

* Scissors to cut a cell

* Copy

* Paste

* Execuate play icon

* Stop button

* Double play icons means restart the kernel (Python process) and run all cells

* Double click to open a text box and use execute icon to close it

2. General: Remember if you just want to make a comment in the code box as a reminder, start with a # tag and Python knows this is not code. For example #the code below is a shortcut I learned from KDNuggets
3. Menu top left:

* File: save or open notebook or create new one. Download code or notebook (ipynb) or print

* Edit: select all, cut, copy or paste. Find and replace. Voice recognition is also available

* Run: run all cells. When you return to your notebook it is likely that you will have to upload your data file again and run all cells

* Kernel is another place to restart, interrupt, or shutdown the kernel

* Settings is where you can select auto save, set themes, etc.

4. Menu on far left

* Folder icon: Click on file with up arrow to load your file. It will show up on the left. Also an create new notebook and folders. Search window for files

* Circle with square shows open tabs, kernels and terminals

* Outline icon shows the section headers. Like a table of contents

* Extensions are sometimes difficult to install so we won't use any



# **Upload a file**

1.   Download the Heart Prediction file from https://www.informaticseducation.org/file-share/4073445b-edf2-4914-900d-d41bb9999db6
2.   Upload it using the upload icon on the left. You should see your file there
3. Note: most packages you will need will already be installed in Jupyter Lab but they still need to be imported each time you start a new Notebook
>Import pandas as pd # you have given it the alias of pd
4. Now read in the csv file to create a dataframe (df)
>df = pd.read_csv('your file'). 
The file path has to be in single quotes. The easiest way is to right click the heart disease prediction file, select "copy path" and paste it between the quotation marks. See code below. Remember to execute the code. 
> You can also read in other file types such as SQL, html or excel; e.g. df = pd.read_excel('myexcelfile'), etc.
5. Once you have modified your dataframe you can read it back to a csv file as df.to_csv('myDataFrame.csv')
6. Note: you can call your dataframe anything you want; df just reminds you it is a dataframe
7. Unlike Google CoLab uploaded files persist. Remember, when you log back in you need to upload the files again and likely run the cells again



In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('Heart_Disease_Prediction_Classification.csv')

# **Pandas**

1.  Pandas allow you to convert your dataset into a dataframe (2 dimensional table) where you can perform basic functions such as exploratory data analysis (EDA). The newer AutoML packages require pandas as well
2. We will review several standard easy pandas functions that support exploratory data analysis (EDA)
3. Pandas-profiling is a Python package that simply expedites some EDA steps. PyCaret uses pandas profiling to automate the early steps. We won't review it as Sweetviz and Dtale are probably superior 
4. To be sure your data were uploaded try df.head(5). This will show you the first 5 rows of data. df.tail(5) will display the last 5 rows.

> df.shape will display the number of rows and columns

> df.info() will display column names, data types and missing data

> df.describe() will give count, mean, std, min, 25%, 50%, 75%, and max






5. For more information about Pandas and to get their cheat sheet visit [DataCamp](https://www.datacamp.com/community/blog/python-pandas-cheat-sheet?utm_source=adwords_ppc&utm_campaignid=12492439679&utm_adgroupid=122563407481&utm_device=c&utm_keyword=extract%20columns%20pandas&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=504158802871&utm_targetid=kwd-614516587576&utm_loc_interest_ms=&utm_loc_physical_ms=9011658&gclid=Cj0KCQjwub-HBhCyARIsAPctr7zp2Aw_dkviDtc5ymCQ_RhG2z0WgqvWkc0tSolN-25XVc7iGoFK17waAgweEALw_wcB) 



In [3]:
df.head(5)

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,Presence
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,Absence
2,57,1,2,124,261,0,0,141,0,0.3,1,0,7,Presence
3,64,1,4,128,263,0,0,105,1,0.2,2,1,7,Absence
4,74,0,2,120,269,0,2,121,1,0.2,1,1,3,Absence


In [4]:
df.shape

(270, 14)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270 entries, 0 to 269
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      270 non-null    int64  
 1   Sex                      270 non-null    int64  
 2   Chest pain type          270 non-null    int64  
 3   BP                       270 non-null    int64  
 4   Cholesterol              270 non-null    int64  
 5   FBS over 120             270 non-null    int64  
 6   EKG results              270 non-null    int64  
 7   Max HR                   270 non-null    int64  
 8   Exercise angina          270 non-null    int64  
 9   ST depression            270 non-null    float64
 10  Slope of ST              270 non-null    int64  
 11  Number of vessels fluro  270 non-null    int64  
 12  Thallium                 270 non-null    int64  
 13  Heart Disease            270 non-null    object 
dtypes: float64(1), int64(12), 

In [6]:
df.describe()

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium
count,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0
mean,54.433333,0.677778,3.174074,131.344444,249.659259,0.148148,1.022222,149.677778,0.32963,1.05,1.585185,0.67037,4.696296
std,9.109067,0.468195,0.95009,17.861608,51.686237,0.355906,0.997891,23.165717,0.470952,1.14521,0.61439,0.943896,1.940659
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0
25%,48.0,0.0,3.0,120.0,213.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0,3.0
50%,55.0,1.0,3.0,130.0,245.0,0.0,2.0,153.5,0.0,0.8,2.0,0.0,3.0
75%,61.0,1.0,4.0,140.0,280.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0


## **Install and Import**

1.   Many of the common Python packages have already been installed. Most just need to be imported. However, some of the newer packages we plan to use require installation first, usually with the Python pip install command. For Sweetviz we needed to use !pip install (note the exclamation point)




# **Sweetviz**

1.   This package is a great way to begin analyzing your data with exploratory data analysis (EDA)
2. [Sweetviz package details](https://pypi.org/project/sweetviz/)
3. [Article](https://towardsdatascience.com/fast-eda-in-jupyter-colab-notebooks-using-sweetviz-2-0-99c22bcb3a1c) on Sweetviz
2.   Sweetviz does the following:


> Analyzes data with descriptive stats and visualizations with 2 lines of code!


> Compares two datasets, such as train vs test datasets


> Performs correlations (associations)

> A recent enhancement causes the report to be part of the notebook and not a separate html page











In [7]:
!pip install sweetviz




Defaulting to user installation because normal site-packages is not writeable


Now that you have installed sweetviz you need to import it.
Be sure to look at the following:

> Carefully look at each column variable. Look at the basic descriptive statistics, plus the numerical and categorical associations. Starting with age note the negative association of age and Max HR. Also, note you can change the bins for age groups to 5,15 and 30

> Sweetviz lists the most frequent, smallest and largest values

> When you are done, select the Associations button at the top for a correlation matrix. Squares represent categorical data and circles represent numerical data

> What did you learn?









In [8]:
import sweetviz as sv

my_report = sv.analyze(df)
my_report.show_notebook() 

                                             |          | [  0%]   00:00 -> (? left)

# **D-Tale**

1. This is a relatively new package for interactive data exploration that also includes visualization. Only requires 2 lines of code. 

2. Package information (https://github.com/man-group/dtale and https://pypi.org/project/dtale/)
3.Pertinent article (https://towardsdatascience.com/introduction-to-d-tale-5eddd81abe3f) (March 2020)
4. Dtale will generate a new html page **outside** this notebook:
5. Select the arrow at the top left to access the Dtale Menu 

* Describe gives options like Sweetviz to include descriptive statistics and histogram and box plots for continuous data, counts and Q-Q plot. Identifies outliers

* Filter

* Dataframe functions to create or modify columns

* Clean column means getting rid of e.g punctuation, etc.

* Merge and stack - append or concatenate vertically

* Summarize data - create a new grid by using a pivot table, group by or transpose

* Analyze duplicates - remove

* Analyze missing data - locate

* Feature analysis (remove multicollinearity)

* Correlations (heatmap)

* Predictive Power Score (similar to correlations)

* Charts (12 chart types). Images can be exported as html or png. File can be exported as csv for those data points. Can export code for chart and hyperlink to chart.

* Hightlight - data types, outliers, missing and range

* Export data or code








6. Select any column to see the options: similar to above, such as look for duplicates, filter, heatmap, etc.





In [14]:
!pip install dtale

Defaulting to user installation because normal site-packages is not writeable
Collecting dtale
  Downloading dtale-1.60.2-py2.py3-none-any.whl (11.4 MB)
     |████████████████████████████████| 11.4 MB 2.1 MB/s            
[?25hCollecting dash-daq
  Downloading dash_daq-0.5.0.tar.gz (642 kB)
     |████████████████████████████████| 642 kB 1.6 MB/s            
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting statsmodels
  Using cached statsmodels-0.13.0-cp38-cp38-macosx_10_15_x86_64.whl (9.5 MB)
Collecting lz4
  Downloading lz4-3.1.3-cp38-cp38-macosx_10_9_x86_64.whl (261 kB)
     |████████████████████████████████| 261 kB 1.1 MB/s            
Collecting missingno<=0.4.2
  Downloading missingno-0.4.2-py3-none-any.whl (9.7 kB)
Collecting scikit-learn==0.24.2
  Downloading scikit_learn-0.24.2-cp38-cp38-macosx_10_13_x86_64.whl (7.2 MB)
     |████████████████████████████████| 7.2 MB 1.2 MB/s            
Collecting ppscore
  Downloading ppscore-1.2.0.tar.gz (47 kB)
     |██

In [24]:
pip install --upgrade dtale

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [27]:
import dtale

In [28]:
d = dtale.show(df)
d.open_browser()

### D-Tale details

Click on the execute icon in the upper left that will open a large menu with the following categories:

1. Describe will provide descriptive stats plus a histogram and box plot for continuous data
2. Custom filter
3. Show, hide, build and clean columns
4. Merge and stack data vertically
5. Find duplicates
6. Find missing data
7. Feature analysis
8. Correlation matrix and chart
9. Predictive Power Score is another way to demonstrate relationships between columns
10. Charts - 12 chart types
11. Heatmap
12. Highlight range, outliers, missing and data types
13. Low variance flag
14. Export code and csv files


# **HONORABLE MENTION**

1. Pandas profiling
2. Lux
3. Mito
