# Introduction to Pandas, Data Cleaning, and Transformation

## 1. Introduction to Pandas library

### 1.1 Overview
Pandas is a popular Python library for data manipulation and analysis. It provides data structures like Series and DataFrame, which are designed to handle a wide variety of data types and enable powerful data manipulation operations.

### 1.2 Installation
If you haven't installed pandas yet, you can do so using the following command:

```bash
!pip install pandas


### 1.3 Importing Pandas
To use pandas in your code, you need to import it. It's common practice to import pandas with the alias pd:

In [None]:
import pandas as pd

## 2. Pandas Data Structures

### 2.1 Series
A Series is a one-dimensional labeled array capable of holding any data type.

Creating a Series

In [None]:
data = [1, 2, 3, 4]
ser = pd.Series(data)
print(ser)

### 2.2 DataFrame
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or SQL table.

Creating a DataFrame

In [None]:
import numpy as np

data = {
    'column1': [1, 2, 3, 4, np.nan],
    'column2': ['A', 'B', 'C', 'D', 'E']
}
df = pd.DataFrame(data)
print(df)

## 3. Data Cleaning

### 3.1 Handling Missing Data
Pandas provides various methods to deal with missing data, such as dropna() and fillna().

Drop missing data

In [None]:
df.dropna()

Fill missing data with a specified value

In [None]:
df.fillna(value=0)

### 3.2 Removing Duplicates
Use the drop_duplicates() method to remove duplicate rows from a DataFrame.

In [None]:
df.drop_duplicates()

In [None]:
df.drop_duplicates()

## 4. Data Transformation

### 4.1 Renaming Columns
You can rename columns in a DataFrame using the rename() method.

In [None]:
df.rename(columns={'old_name': 'new_name'}, inplace=True)

### 4.2 Replacing Values
To replace values in a DataFrame, you can use the replace() method.

In [None]:
df.replace({'old_value': 'new_value'}, inplace=True)

### 4.3 Sorting Data
You can sort a DataFrame by the values in one or more columns using the sort_values() method.

In [None]:
df.sort_values(by=['column_name'], ascending=True, inplace=True)

## 5. Read & Write Data from an Existing file

### 5.1 Reading from csv/Excel

In [None]:
# !pip install openpyxl

In [None]:
file1 = "pre-course_survey.csv"
file2 = "pre-course_survey.xlsx"

df_csv = pd.read_csv(file1)
df_excel = pd.read_excel(file2)
df_csv.shape, df_excel.shape

In [None]:
df_csv

In [None]:
df_excel

### 5.2 Renaming Columns

In [None]:
old_cols = df_csv.columns
new_cols = ["timestamp", "python", "datascience", "git", "jupyter", "reproducibility", "nlp", "socialmedia_data", "goals", "skills", "concerns"]
cols_dict = dict(zip(old_cols, new_cols))

In [None]:
df_csv.rename(columns=cols_dict, inplace=True)

In [None]:
df_csv

### 5.3 Describing Data

In [None]:
df_csv.describe()

In [None]:
df_csv["python"].value_counts()

In [None]:
df_csv

In [None]:
df_csv.columns

### Exercise 1

1. Find columns that have values "Yes/No". 
2. Convert "Yes/No" to "1/0" for all of those columns. 
3. Count how many "1" and "0" in each column.

### Exercise 2

1. identify the row(s) where the participant's "reproducibility" == 5
2. find the "goals", "skills", and "concerns" for that participant. hint: `df.loc`