Chapter 5 of "Python for Data Analysis" introduces the pandas library and its fundamental data structures, along with essential functionalities for data manipulation. 

## Introduction to pandas Data Structures:
* Series: A *one-dimensional* labeled array capable of holding any data type. Series can be created from lists, NumPy arrays, or dictionaries, and their labels (the index) allow for easy data access and alignment.

* DataFrame: A *two-dimensional* table-like structure with labeled rows and columns, capable of holding heterogeneous data. DataFrames can be created from dictionaries of lists or NumPy arrays, and are a fundamental structure for data analysis in pandas. While physically two-dimensional, they can also represent higher dimensional data using hierarchical indexing.

* Index Objects: The labels for the rows and columns in pandas, which are essential for data alignment and access. Index objects are immutable and have methods and properties for set logic, which help answer questions about the data they contain

## Exercise
* Concept Question: What is the primary difference between a pandas Series and a pandas DataFrame, and in what scenarios would you prefer one over the other?
  * A pandas Series is 1D, and a pandas DataFrame is 2D.
  * Use Series when working with a single variable or column of data, and prefer a DataFrame when managing multiple related variables or columns.

* Coding Question 1: Create a pandas Series named my_series with the following data and labels: Data:, Labels: ['a', 'b', 'c', 'd', 'e']. Then, access the element with the label 'c'.

* Coding Question 2: Create a pandas DataFrame named my_df from the following dictionary: {'col1':, 'col2':, 'col3':}. Then, add a new column named col4 with the values.


In [1]:
import pandas as pd

In [8]:
my_series = pd.Series(data=[10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
my_df = pd.DataFrame(data={"name": ["Rex", "Dita"], "age": [24, 21], "gender": ["male", "female"]})

#### Lesson Learned
##### KeyError: 0
* Explanation: my_df[0] => KeyError: 0 Because when I try to access my_df[0], Pandas attemptps to look for a column with the label 0, which does not exist. DataFrames are primarily indexed by columns, so accessing rows requires specific methods.
* Solution:
  * .loc[] for *label-based indexing*: Use .loc to access rows based on their *index labels*. If the DataFrame *index is numeric* and I want the first row - my_df.loc[0]  # Access the row with index label 0
  * .iloc[] for *position-based indexing*: Use .iloc to access rows based on their *integer position* in the DataFrame. - my_df.iloc[0] # Access the first row (0th position)


In [24]:
# Access to the data

# my_series['c']
# my_df.iloc[0]

# Add more data to a Series
my_series_2 = pd.Series([0, -10, -20], index=["f", "g", "h"])
combined_series = pd.concat([my_series, my_series_2])
print(combined_series)

# Add more data to a DataFrame
address = ["Utrecht", "Den Haag"]
my_df["address"] = address

a    10
b    20
c    30
d    40
e    50
f     0
g   -10
h   -20
dtype: int64


## Essential Functionality:
* Reindexing: Creating a new object with data conformed to a new index. *This allows for changing the order of the data, adding missing values, or removing elements*. 
* Dropping Entries from an Axis: Removing rows or columns from a Series or DataFrame.
* Indexing, Selection, and Filtering: Selecting subsets of data from Series or DataFrames using labels, positions, or boolean arrays. This includes using the loc and iloc operators.
* Integer Indexes: Understanding how integer indexes are handled in pandas, which can differ from standard Python indexing.
* Arithmetic and Data Alignment: Performing arithmetic operations between Series or DataFrames and how pandas automatically aligns data based on labels and how to use fill values.
* Function Application and Mapping: Applying functions to data in Series or DataFrames using methods like apply and map.
* Sorting and Ranking: Sorting data in Series or DataFrames by index or column values and assigning ranks to values.
* Axis Indexes with Duplicate Labels: Understanding how pandas handles data with duplicate index labels.

## Exercise

* Concept Question: Explain the purpose of reindexing in pandas. Provide an example where reindexing is necessary.
* Coding Question 1: Given a pandas Series named sales with index ['Mon', 'Wed', 'Fri'], reindex it to ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"] and fill any missing values with 0.
* Coding Question 2: Create a pandas DataFrame named df with columns named 'A', 'B', and 'C'. Drop the column labeled 'B'.

* Concept Question: How does pandas handle arithmetic operations when performing addition between two Series with different index labels? What about when operating on a DataFrame?
* Coding Question 1: Given two pandas Series, s1 with index ['a', 'b', 'c'] and values and s2 with index ['b', 'c', 'd'] and values, add them together, and replace missing values with 0.
* Coding Question 2: Create two pandas DataFrames with some different index labels, named df1 and df2, then add them together.

* Concept Question: Describe the difference between the apply and map methods for pandas Series, and where you might use them in data processing.
* Coding Question 1: Given a pandas Series named series, apply a lambda function to each element that multiplies it by 2 using the apply method.
* Coding Question 2: Given the pandas series named series, create a mapping using a dictionary such that every 1 is replaced by 'a', 2 by 'b', 3 by 'c'. Use this mapping to transform the series using the map method.

* Concept Question: What is the difference between sorting by index labels versus sorting by column values in a DataFrame?
* Coding Question 1: Given a DataFrame df, sort it by its index labels in descending order.
* Coding Question 2: Given the same DataFrame df, sort it by the values in column 'A' in ascending order, while handling any missing values by putting them at the end.

* Concept Question: What challenges arise when dealing with duplicate labels in a DataFrame's index, and how does pandas allow to deal with it?
* Coding Question 1: Create a DataFrame df with duplicate index labels: pd.DataFrame({'A':}, index = ['a', 'a', 'b', 'b']). Then, select the rows with label 'a'.
* Coding Question 2: How would you get the sum of values for each unique index, given the previous DataFrame?


In [33]:
# Reindexing for Series
sales = pd.Series([200, 300, 400], index=["Mon", "Wed", "Fri"])
new_index = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
reindexed_sales = sales.reindex(new_index, fill_value = 0)

# Reindexing for DataFrame (New index for rows; New index for columns)
data = {
    "Product": ["A", "B", "C"],
    "Price": [50, 30, 20],
    "Quantity": [100, 150, 200]
}
df = pd.DataFrame(data, index=["Store1", "Store2", "Store3"])
print("Original DataFrame:\n", df)

new_index = ["Store1", "Store2", "Store3", "Store4"]
reindexed_df_rows = df.reindex(new_index, fill_value=0)
print("Reindexed Series:\n", reindexed_sales)


new_columns = ["Product", "Price", "Quantity", "Discount"]
reindexed_df_cols = df.reindex(columns=new_columns, fill_value=0)
print("Reindexed DataFrame (Columns):\n", reindexed_df_cols)

Original DataFrame:
        Product  Price  Quantity
Store1       A     50       100
Store2       B     30       150
Store3       C     20       200
Reindexed Series:
        Product  Price  Quantity
Store1       A     50       100
Store2       B     30       150
Store3       C     20       200
Store4       0      0         0
Reindexed DataFrame (Columns):
        Product  Price  Quantity  Discount
Store1       A     50       100         0
Store2       B     30       150         0
Store3       C     20       200         0


In [38]:
# Remove a row from a series


# Remove a column from a dataframe

songs_data = {
    "Song Title": ["Shape of You", "Blinding Lights", "Bad Guy", "Rolling in the Deep", "Someone Like You"],
    "Singer": ["Ed Sheeran", "The Weeknd", "Billie Eilish", "Adele", "Adele"],
    "Year": [2017, 2019, 2019, 2011, 2011],
    "Genre": ["Pop", "Synthpop", "Electropop", "Soul", "Soul"]
}

songs_df = pd.DataFrame(songs_data)

songs_df.drop(["Genre", "Year"], axis=1)

# Remove a row from a dataframe

## Firstly, locate the element
ed_sheeran_index = songs_df[(songs_df.Singer == "Ed Sheeran")].index

## Secondly, remove the element
songs_df.drop(ed_sheeran_index)

Unnamed: 0,Song Title,Singer,Year,Genre
1,Blinding Lights,The Weeknd,2019,Synthpop
2,Bad Guy,Billie Eilish,2019,Electropop
3,Rolling in the Deep,Adele,2011,Soul
4,Someone Like You,Adele,2011,Soul
