<a href="https://colab.research.google.com/github/zwelshman/healthcare-data-analysis-in-python/blob/main/beginner/data_analysis/Introduction_to_Data_analysis_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Basic Introduction to Data Analysis in Python

Pandas is an open-source Python library that provides high-performance, easy-to-use data structures, and data analysis tools. It is particularly suited for working with tabular data (similar to SQL tables or Excel spreadsheets) and offers data structures like Series and DataFrame. The library is built on top of NumPy and is a staple in data science and analytics workflows.


To generate a dataset for medical records as requested, we'll use Python's pandas library for data manipulation and the `Faker` library to simulate patient data. The dataset will include 10,000 medical records with the following fields: Patient ID, Name, Age, Gender, Diagnosis, and Last Visit Date. This example assumes you have both pandas and Faker installed in your Python environment. If not, you can install them using `pip`:

In [None]:
# !pip install pandas faker

In [None]:
import pandas as pd
from faker import Faker
import random

# Initialize Faker
fake = Faker()

# Define possible diagnoses
diagnoses = ['Diabetes', 'Hypertension', 'Asthma', 'Heart Disease', 'None']

# Generate data
data = {
    'PatientID': [i for i in range(1, 10001)],
    'Name': [fake.name() for _ in range(10000)],
    'Age': [random.randint(0, 100) for _ in range(10000)],
    'Gender': [random.choice(['Male', 'Female']) for _ in range(10000)],
    'Diagnosis': [random.choice(diagnoses) for _ in range(10000)],
    'Last Visit Date': [fake.date_between(start_date='-5y', end_date='today') for _ in range(10000)]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display the first few rows of the dataframe
print(df.head())

# Optionally, save the dataframe to a CSV file
df.to_csv('medical_records.csv', index=False)

   PatientID               Name  Age  Gender     Diagnosis Last Visit Date
0          1      Julie Rowland    2  Female  Hypertension      2019-08-09
1          2   Jacqueline Mcgee   17  Female  Hypertension      2023-05-02
2          3    Monique Spencer   27    Male      Diabetes      2022-08-11
3          4     Tony Austin II   75  Female        Asthma      2019-09-16
4          5  Samantha Gonzales  100  Female        Asthma      2021-06-06


## Importing and exporting data with Pandas
Pandas supports various file formats for importing and exporting data, such as CSV, Excel, JSON, HTML, and SQL databases. The most common functions used for this purpose are `read_csv()` for reading CSV files and `to_csv()` for writing to CSV files.

In [None]:
import pandas as pd

# Importing data from a CSV file
df = pd.read_csv('medical_records.csv')

Let's take a a look at the to 5 rows using `head()` on the DataFrame

In [None]:
df.head()

Unnamed: 0,PatientID,Name,Age,Gender,Diagnosis,Last Visit Date
0,1,Julie Rowland,2,Female,Hypertension,2019-08-09
1,2,Jacqueline Mcgee,17,Female,Hypertension,2023-05-02
2,3,Monique Spencer,27,Male,Diabetes,2022-08-11
3,4,Tony Austin II,75,Female,Asthma,2019-09-16
4,5,Samantha Gonzales,100,Female,Asthma,2021-06-06


In [None]:
# Exporting data to a CSV file
df.to_csv('exported_records.csv', index=False)

##Data cleaning techniques: handling missing values, filtering, and data transformation
Data cleaning is an essential step in preparing data for analysis. Pandas provides several methods for handling missing values, such as `dropna()` to remove rows or columns with missing data and `fillna()` to replace missing values with a specified value.

In [None]:
# Handling missing values by dropping them
df_cleaned = df.dropna()

# Filling missing values with a specified value
df_filled = df.fillna(value=0)

Filtering data is done using boolean indexing, and data transformation can involve operations like applying functions to columns or rows, mapping values, or replacing values.

In [None]:
# Filtering data
df_filtered = df[df['Diagnosis'] == 'Diabetes']

# Data transformation
df['Age'] = df['Age'].apply(lambda x: x + 1)

##Data manipulation: indexing, merging, grouping, and reshaping data
Indexing in Pandas allows for selecting specific rows and columns. Merging and joining DataFrames is akin to SQL joins and can be done using `merge()` or `join()` functions.

In [None]:
# Indexing
df_subset = df.loc[df['Age'] > 30, ['Name', 'Diagnosis']]

# Merging two DataFrames
df_merged = pd.merge(df1, df2, on='PatientID')

NameError: name 'df1' is not defined

Grouping data is performed with the `groupby()` function, which is useful for aggregate computations. Reshaping data can involve pivoting with `pivot()` or `pivot_table()` and melting with `melt()`.

In [None]:
# Grouping data
df_grouped = df.groupby('Diagnosis').mean()

# Reshaping data with pivot
df_pivoted = df.pivot(index='PatientID', columns='Diagnosis', values='TestResult')

  df_grouped = df.groupby('Diagnosis').mean()


KeyError: 'TestResult'

By mastering these Pandas functionalities, you can effectively manage and prepare patient records for analysis, focusing on the diagnosis or any other aspect of the dataset.