# Pandas for Data Science

This notebook contains the code of the pandas library for data science


In [None]:
# Import necessary library
import numpy as np
import pandas as pd
from io import StringIO

## Numpy


`np.array` is a function that returns a new numpy array object, which is an instance of `np.ndarray`.

`np.ndarray` is the actual data type, while `np.array` is a function to create an ndarray.

In practice, you'll almost always use `np.array` to create your arrays. This is because `np.ndarray` is a low-level method for creating arrays and does not support many of the convenient syntax features that `np.array` does, such as creating an array from a list.


In [None]:
# np.array
list_1 = [1, 2, 3, 4, 5]
first_array = np.array(list_1)
print(first_array)

# Pandas


- **Series**: A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. You can think of a Series as a single column of data in a table.


In [None]:
# Series
ages = np.array([15, 24, np.nan, 35])
series_1 = pd.Series(ages)
series_1

In [None]:
# Indexing Series
series_2 = pd.Series(ages, index=["Emma", "Brown", "Jake", "Drake"])
series_2

- **DataFrame**: A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.


In [None]:
# Dataframe by passing a dictionary of objects where the keys are the column labels and the values are the column values.
df = pd.DataFrame(
    {
        "datetime": pd.Timestamp("20130102"),
        "random_list": pd.Series(1, index=list(range(4)), dtype="float32"),
        "random_array": np.array([3] * 4, dtype="float32"),
        "random_categorical": pd.Categorical(["test", "train", "test", "train"]),
        "single_string": "gugu",
    }
)

df

- Set index: `df.set_index('column_name')`


In [None]:
# Set index
df.set_index("random_array", inplace=True)
df

- Date Range: `pd.date_range(start='2019-01-01', end='2019-01-31', freq='D')`
- Date Range: `pd.date_range(start='2019-01-01', periods=31, freq='D')`
- Date Range: `pd.date_range(start='2019-01-01', periods=12, freq='M')`
- Date Range: `pd.date_range(start='2019-01-01', periods=12, freq='MS')`
- Date Range: `pd.date_range(start='2019-01-01', periods=12, freq='QS')`
- Date Range: `pd.date_range(start='2019-01-01', periods=12, freq='Q')`
- Date Range: `pd.date_range(start='2019-01-01', periods=12, freq='YS')`
- Date Range: `pd.date_range(start='2019-01-01', periods=12, freq='Y')`
- Date Range: `pd.date_range(start='2019-01-01', periods=12, freq='AS')`


In [None]:
# Create a Dataframe with datetime index
dates = pd.date_range("20200101", periods=6, freq="M")
dates

In [None]:
# Create a dataframe object from a 6x4 arrays of random floats
# Index it with the dates, and name the columns as "A", "B", "C", "D"
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=["A", "B", "C", "D"])
df

In [None]:
# Create a dataframe object from a 6x4 arrays of random floats
# Index it with dates and name the columns as "num1", "num2", "num3"
df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=["num1", "num2", "num3"])
df

In [None]:
# Remove the index in dataframe
df.reset_index(drop=True, inplace=True)
df

In [None]:
df.dtypes

In [None]:
# Create a DataFrame
df = pd.DataFrame({"B": [1, 2, 3], "A": [4, 5, 6], "D": [7, 8, 9], "C": [10, 11, 12]})
df

In [None]:
# Sort DataFrame by column labels
df.sort_index(axis=1, inplace=True)
df

In [None]:
# Sort DataFrame by column labels
df.sort_index(axis=1, inplace=True, ascending=False)
df

In [None]:
df["D"]

In [None]:
df.D

- Selection by lable: `df.loc['row_name']`


In [None]:
# Create a dataframe object from a 6x4 arrays of random floats
# Index it with dates and name the columns as "num1", "num2", "num3"
df = pd.DataFrame(np.random.randn(6, 3), index=dates, columns=["num1", "num2", "num3"])
df

In [None]:
df.loc[dates[0:3]]

In [None]:
df.loc[:, ["num1", "num2"]]

- Selection by position: `df.iloc[0]`


In [None]:
df

In [None]:
df.iloc[2]

In [None]:
df.iloc[0:3, 0:2]

In [None]:
# Save dataframe to the csv
df.to_csv("./data/table.csv")

In [None]:
# Load the data csv
df1 = pd.read_csv("./data/table.csv")
df1

In [None]:
df2 = pd.DataFrame(
    [
        ["January", 100, 100, 23, 100],
        ["February", 51, 45, 145, 45],
        ["March", 81, 96, 65, 96],
        ["April", 80, 80, 54, 180],
        ["May", 51, 54, 54, 154],
        ["June", 112, 109, 79, 129],
    ],
    columns=["month", "clinic_east", "clinic_north", "clinic_south", "clinic_west"],
)

In [None]:
df2[df2.month == "April"]

In [None]:
march_april = df2[(df2.month == "March") | (df2.month == "April")]
march_april

In [None]:
orders = pd.read_csv("./data/shoefly.csv")
df3 = pd.DataFrame(orders)
emails = df3.email
frances_palmer = df3[df3.first_name == "Frances"]
comfy_shoes = df3[df3.shoe_type.isin(["clogs", "boots", "ballet flats"])]

print(orders.head(5))
print(emails)
print(frances_palmer)
print(comfy_shoes)

In [None]:
first = 1
second = 2
third = 3

print(
    "This is the first number {one}, the second number: {two}, and the third number: {three}".format(
        one=first, two=second, three=third
    )
)

print(
    f"This is the first number {first}, the second number: {second}, and the third number: {third}"
)

In [None]:
df3.head(10)

In [None]:
df3["description"] = df3.apply(
    lambda row: f"{row.first_name} {row.last_name} bought {row.shoe_color} {row.shoe_type} from material {row.shoe_material}",
    axis=1,
)
df3.head(10)

In [None]:
text = "id,	name,	hourly_wage,	hours_worked,\
10310,	Lauren Durham,	19,	43,\
18656,	Grace Sellers,	17,	40,\
61254,	Shirley Rasmussen,	16,	30,\
16886,	Brian Rojas,	18,	47,\
89010,	Samantha Mosley,	11,	38,\
87246,	Louis Guzman,	14,	39,\
20578,	Denise Mcclure,	15,	40,\
12869,	James Raymond,	15,	32,\
53461,	Noah Collier,	18,	35,\
14746,	Donna Frederick,	20,	41,\
171127,	Shirley Beck,	14,	32,\
192522,	Christina Kelly,	8,	44,\
122447,	Brian Noble,	11,	39,\
161654,	Randy Key,	16,	38,\
116988,	Diana Stewart,	14,	48,\
168619,	Timothy Sosa,	14,	42,\
159949,	Betty Skinner,	11,	48,\
181418,	Janet Maxwell,	12,	38,\
127267,	Madison Johnston,	20,	37,\
119985,	Virginia Nichols,	13,	49,"

In [None]:
cleaned_text = text.strip("")

print(cleaned_text)

In [None]:
splitted = cleaned_text.split(",")
print(splitted)

In [None]:
# Remove leading '\t' from each string
data = [item.strip() for item in splitted]
print(data)

In [None]:
# Group every four elements together
rows = [data[i : i + 4] for i in range(0, len(data), 4)]
print(rows)

In [None]:
# Join the elements of each row with a comma
csv_rows = [",".join(row) for row in rows]
print(csv_rows)

In [None]:
# Join all the rows with a newline
csv_data = "\n".join(csv_rows)
print(csv_data)

In [None]:
# Convert the string csv data to a dataframe
df_csv = pd.read_csv(StringIO(csv_data))

In [None]:
# View the DataFrame
df_csv.head(10)

In [None]:
# Write the DataFrame into csv
df_csv.to_csv("./data/worker.csv", index=False)

The `index=False` argument in the `to_csv` function is used to tell pandas not to write the DataFrame's index into the CSV file.

By default, `to_csv` writes the DataFrame's index as the first column in the CSV file. If your DataFrame's index is not meaningful and you don't want it to be written into the CSV file, you can set `index=False`.

For example, if your DataFrame `df` looks like this:

|     | A   | B   |
| --- | --- | --- |
| 0   | 1   | 2   |
| 1   | 3   | 4   |
| 2   | 5   | 6   |

Calling `df.to_csv('file.csv')` will produce a CSV file like this:

```
,A,B
0,1,2
1,3,4
2,5,6
```

But if you call `df.to_csv('file.csv', index=False)`, the CSV file will look like this:

```
A,B
1,2
3,4
5,6
```

As you can see, the index column (0, 1, 2) is not written into the CSV file when `index=False`.


In [None]:
# Second challenge
text2 = "id,	name,	hourly_wage,	hours_worked,\
10310,	Lauren Durham,	19,	43,\
18656,	Grace Sellers,	17,	40,\
61254,	Shirley Rasmussen,	16,	30,\
16886,	Brian Rojas,	18,	47,\
89010,	Samantha Mosley,	11,	38,\
87246,	Louis Guzman,	14,	39,\
20578,	Denise Mcclure,	15,	40,\
12869,	James Raymond,	15,	32,\
53461,	Noah Collier,	18,	35,\
14746,	Donna Frederick,	20,	41,\
171127,	Shirley Beck,	14,	32,\
192522,	Christina Kelly,	8,	44,\
122447,	Brian Noble,	11,	39,\
161654,	Randy Key,	16,	38,\
116988,	Diana Stewart,	14,	48,\
168619,	Timothy Sosa,	14,	42,\
159949,	Betty Skinner,	11,	48,\
181418,	Janet Maxwell,	12,	38,\
127267,	Madison Johnston,	20,	37,\
119985,	Virginia Nichols,	13,	49,"

In [None]:
df_work = pd.read_csv("./data/worker.csv")
df_work.head(10)

In [None]:
get_last_name = lambda name: name.split()[-1]

df_work["last_name"] = df_work.name.apply(get_last_name)

In [None]:
df_work

In [None]:
total_earned = (
    lambda row: (40 * row.hourly_wage)
    + (row.hours_worked - 40) * (1.5 * row.hourly_wage)
    if row.hours_worked > 40
    else (row.hours_worked * row.hourly_wage)
)

In [None]:
df_work["total_earned"] = df_work.apply(total_earned, axis=1)
df_work