<a href="https://colab.research.google.com/github/whitfieldscott/4GeeksAcademy/blob/master/03-pandas/03.1-Intro-To-Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Pandas logo](https://github.com/4GeeksAcademy/machine-learning-prework/blob/main/03-pandas/assets/pandas_logo.png?raw=true)

## Introduction to Python Pandas

**Pandas** is an open-source python library that provides data structures and is designed to handle and analyze tabular data in Python. Pandas is based on NumPy, which allows it to integrate well into the data science ecosystem alongside other libraries such as `Scikit-learn` and `Matplotlib`.

Specifically, the key points of this library are:

- **Data structures**: This library provides two structures for working with data. These are the `Series` which are labeled one-dimensional arrays, similar to a vector, list or sequence and which is able to contain any type of data, and the `DataFrames`, which is a labeled two-dimensional structure with columns that can be of different types, similar to a spreadsheet or a SQL table.
- **Data manipulation**: Pandas allows you to carry out an exhaustive data analysis through functions that can be applied directly on your data structures. These operations include missing data control, data filtering, merging, combining and joining data from different sources...
- **Efficiency**: All operations and/or functions that are applied on data structures are vectorized to improve performance compared to traditional Python loops and iterators.

Pandas is a fundamental tool for any developer working with data in Python, as it provides a wide variety of tools for data exploration, cleaning and transformation, making the analysis process more efficient and effective.

### Data Structures in Python Pandas

Pandas provides two main data structures: `Series` and `DataFrames`.

#### Series

A **series** in Pandas is a one-dimensional labeled data structure. It is similar to a 1D array in NumPy, but has an index that allows access to the values by label. A series can contain any kind of data: integers, strings, Python objects...

![Example of a series](https://github.com/4GeeksAcademy/machine-learning-prework/blob/main/03-pandas/assets/series.PNG?raw=true)

A Pandas series has two distinct parts:

- **Index** (*index*): An array of tags associated with the data.
- **Value** (*value*): An array of data.

A series can be created using the `Series` class of the library with a list of elements as an argument. For example:

In [5]:
import pandas as pd

serie = pd.Series([1, 2, 3, 4, 5])
serie

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


This will create a series with elements 1, 2, 3, 4 and 5. In addition, since we have not included information about the indexes, an automatic index is generated starting at 0:

In [6]:
serie = pd.Series([1, 2, 3, 4, 5], index = ["a", "b", "c", "d", "e"])
serie

Unnamed: 0,0
a,1
b,2
c,3
d,4
e,5


Thus, the previous series has an index composed of letters.

Both series store the same values, but the way they are accessed may vary according to the index.

In a series, its elements can be accessed by index or by position (the latter is what we did in NumPy). Below are some operations that can be performed using the above series:

In [19]:
# Access the third element
print(serie["c"]) # By index
print(serie[2]) # By position

# Change the value of the second element
serie["b"] = 7
print(serie)

# Add 10 to all elements
serie += 10
print(serie)

# Calculate the sum of the elements
sum_all = serie.sum()
print(sum_all)

103
103
a    101
b      7
c    103
d    104
e    105
dtype: int64
a    111
b     17
c    113
d    114
e    115
dtype: int64
470


  print(serie[2]) # By position


#### Pandas DataFrame

A **DataFrame** in Pandas is a two-dimensional labeled data structure. It is similar to a 2D array in NumPy, but has an index that allows access to the values per label, per row, and column.

![Example of a DataFrame](https://github.com/4GeeksAcademy/machine-learning-prework/blob/main/03-pandas/assets/dataframe.PNG?raw=true)

A DataFrame in Pandas has several differentiated parts:

- **Data** (*data*): An array of values that can be of different types per column.
- **Row index** (*row index*): An array of labels associated to the rows.
- **Column index** (*column index*): An array of labels associated to the columns.

A DataFrame can be seen as a set of series joined in a tabular structure, with an index per row in common and a column index specific to each series.

![Series and DataFrames](https://github.com/4GeeksAcademy/machine-learning-prework/blob/main/03-pandas/assets/series_dataframe.png?raw=true?raw=true)

A DataFrame can be created using the `DataFrame` class. For example:

In [36]:
dataframe = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], index = ["a", "b", "c"])
dataframe

Unnamed: 0,0,1,2
a,1,2,3
b,4,5,6
c,7,8,9


This will create a DataFrame with three rows and three columns for each row. As was the case with series, a DataFrame will generate automatic indexes for rows and columns if they are not passed as arguments in the constructor of the class. If we wanted to create a new DataFrame with concrete indexes for rows and columns, it would be programmed as follows:

In [39]:
data = {
    "col A": [1, 2, 3],
    "col B": [4, 5, 6],
    "col C": [7, 8, 9]
}

dataframe = pd.DataFrame(data, index = ["a", "b", "c"])
dataframe

Unnamed: 0,col A,col B,col C
a,1,4,7
b,2,5,8
c,3,6,9


In this way, a custom index is provided for the columns (labeling the rows within a dictionary) and for the rows (with the `index` argument, as was the case with the series).

In a DataFrame its elements can be accessed by index or by position. Below are some operations that can be performed using the above DataFrame:

In [70]:
# Access all the data in a column
print(dataframe["col A"]) # By index
print(dataframe.loc[:,"col A"]) # By index
print(dataframe.iloc[:,0]) # By position

# Access all the data in a row
print(dataframe.loc["a"]) # By index
print(dataframe.iloc[0]) # By position

# Access to a specific element (row, column)
print(dataframe.loc["a", "col A"]) # By index
print(dataframe.iloc[0, 0]) # By position

# Create a new column
dataframe["col D"] = [10, 11, 12]
print(dataframe)

# Create a new row
dataframe.loc["d"] = [13, 14, 15, 16]
print(dataframe)

# Multiply by 10 the elements of a column
dataframe["col A"] *= 10
print(dataframe)

# Calculate the sum of all elements
sum_all = dataframe.sum()
print(sum_all)

a     10
b     20
c     30
d    130
Name: col A, dtype: int64
a     10
b     20
c     30
d    130
Name: col A, dtype: int64
a     10
b     20
c     30
d    130
Name: col A, dtype: int64
col A    10
col B     4
col C     7
col D    10
Name: a, dtype: int64
col A    10
col B     4
col C     7
col D    10
Name: a, dtype: int64
10
10


ValueError: Length of values (3) does not match length of index (4)

### Functions in Python Pandas

Pandas provide a large number of predefined functions that can be applied on the data structures seen above. Some of the most used in data analysis are:

In [237]:
import pandas as pd

s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
d1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
d2 = pd.DataFrame([[7, 8, 9], [10, 11, 12]])

# Arithmetic Operations
print("Sum of series:\n\n", s1.add(s2))

print("Sum of DataFrames:\n\n", d1.add(d2))

Sum of series:

 0    5
1    7
2    9
dtype: int64
Sum of DataFrames:

     0   1   2
0   8  10  12
1  14  16  18


In [238]:
import pandas as pd

s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
d1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
d2 = pd.DataFrame([[7, 8, 9], [10, 11, 12]])

# Arithmetic Operations
print("Sum of series:", s1.add(s2))
print("Sum of DataFrames:", d1.add(d2))

# Statistical Operations
# They can be applied in the same way to DataFrames
print("Mean:", s1.mean())
print("Median:", s1.median())
print("Number of elements:", s1.count())
print("Standard deviation:", s1.std())
print("Variance:", s1.var())
print("Maximum value:", s1.max())
print("Minimum value:", s1.min())
print("Correlation:", s1.corr(s2))
print("Statistic summary:", s1.describe())

Sum of series: 0    5
1    7
2    9
dtype: int64
Sum of DataFrames:     0   1   2
0   8  10  12
1  14  16  18
Mean: 2.0
Median: 2.0
Number of elements: 3
Standard deviation: 1.0
Variance: 1.0
Maximum value: 3
Minimum value: 1
Correlation: 1.0
Statistic summary: count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64


#### Pandas allows you to use custom python functions (including lambda)

In addition to the Pandas predefined functions, we can also define and apply others to the data structures. To do this, we have to program the function to receive a value (or a column or row in the case of a DataFrame) and return another modified one, and reference it with `apply`.

In addition, this function allows using **lambda expressions** for the anonymous declaration of functions.

The following shows how to apply functions to series:

In [82]:
import pandas as pd
s = pd.Series([1, 2, 3, 4])

# Explicit definition of the function
def squared(x):
    return x ** 2
s1 = s.apply(squared)
print(s1)

# Anonymous definition of the function
s2 = s.apply(lambda x: x ** 2)
print(s2)

0     1
1     4
2     9
3    16
dtype: int64
0     1
1     4
2     9
3    16
dtype: int64


The following shows how to apply functions to a DataFrame, which can be done by row, by column or by elements, similar to series:

In [92]:
df = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [4, 5, 6]
})

# Apply function along a column
df["A"] = df["A"].apply(lambda x: x ** 2)
print(df)

# Apply function along a row
df.loc[0] = df.loc[0].apply(lambda x: x ** 2)
print(df)

# Apply function to all elements
df = df.applymap(lambda x: x ** 2)
print(df)

   A  B
0  1  4
1  4  5
2  9  6
   A   B
0  1  16
1  4   5
2  9   6
    A    B
0   1  256
1  16   25
2  81   36


  df = df.applymap(lambda x: x ** 2)


`apply` is more flexible than other vectorized Pandas functions, but can be slower, especially when applied to large data sets. It is always important to explore the Pandas or NumPy built-in functions first, as they are usually more efficient than the ones we could implement ourselves.

Also, this function can return results in different ways, depending on the function applied and how it is configured.

## Start practicing the Pandas syntax in python righ now!

> Click on Open in Colab to do the exercises

> 🛟 Solutions: In this link you can find the [solutions for the following pandas exercises](https://4geeks.com/lesson/pandas-exercises-and-solutions).



### Creation of Series and Pandas DataFrames

#### Pandas Exercise 01: Create a Series from a list, a NumPy array and a dictionary (★☆☆)

> NOTE: Review the class `pd.Series` (https://pandas.pydata.org/docs/reference/api/pandas.Series.html)

In [93]:
import pandas as pd
import numpy as np

In [177]:
# from a list
my_list = [10, 20, 30, 40, 50]
series_list = pd.Series(my_list)
print(series_list)

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [97]:
#from NumPy Array
my_array = np.array([10, 20, 30, 40, 50])
series_array = pd.Series(my_array)
print(series_array)

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [99]:
#from Dictionary
my_dict = {'a': 10, 'b': 20, 'c': 30, 'd': 40, 'e': 50}
series_dict = pd.Series(my_dict)
print(series_dict)

a    10
b    20
c    30
d    40
e    50
dtype: int64


#### Pandas Exercise 02: Create a DataFrame from a NumPy array, a dictionary and a list of tuples (★☆☆)

> NOTE: Review the class `pd.DataFrame` (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

In [141]:
# NumPy array
array_data = np.array([[1, 2, 3],
                       [4, 5, 6],
                       [7, 8, 9]])

#converted to Dataframe
df_array = pd.DataFrame(array_data, columns=['Col A', 'Col B', 'Col C'], index=['R1', 'R2', 'R3'])
print("DataFrame from NumPy array:\n\n", df_array, "\n")


DataFrame from NumPy array:

     Col A  Col B  Col C
R1      1      2      3
R2      4      5      6
R3      7      8      9 



In [142]:
#from Dictionary
dict_data = {'Col A': [1, 4, 7],
             'Col B': [2, 5, 8],
             'Col C': [3, 6, 9]}

#convert to Dataframe
df_dict = pd.DataFrame(dict_data, index=['R1', 'R2', 'R3'])
print("DataFrame from Dictionary:\n\n", df_dict, "\n")

DataFrame from Dictionary:

     Col A  Col B  Col C
R1      1      2      3
R2      4      5      6
R3      7      8      9 



In [147]:
#list of Tuples
tuple_data = [('Col A', 1, 1.0),
              ('Col B', 2, 2.0),
              ('Col C', 3, 3.0)]

#convert to Dataframe
df_tuple = pd.DataFrame(tuple_data, columns=['Column', 'Integer', 'Float'], index=['R1', 'R2', 'R3'])
print("DataFrame from Tuple:\n\n", df_tuple, "\n")

DataFrame from Tuple:

    Column  Integer  Float
R1  Col A        1    1.0
R2  Col B        2    2.0
R3  Col C        3    3.0 



#### Pandas Exercise 03: Create 2 Series and use them to build a DataFrame (★☆☆)

> NOTE: Review the functions `pd.concat` (https://pandas.pydata.org/docs/reference/api/pandas.concat.html) and `pd.Series.to_frame` (https://pandas.pydata.org/docs/reference/api/pandas.Series.to_frame.html)

In [221]:
#series 1
names = pd.Series(['Scott', 'Mike', 'Kody', 'Brandon'])
#series 2
ages = pd.Series([47, 49, 26, 30])

#Dataframe from series
df = pd.concat([names, ages], axis=1)
df.columns = ['Name', 'Age']
df.index = ['R1', 'R2', 'R3', 'R4']
print("Family Info:\n\n", df, "\n")

Family Info:

        Name  Age
R1    Scott   47
R2     Mike   49
R3     Kody   26
R4  Brandon   30 



### Filtering and updating

#### Exercise 04: Use the Series created in the previous exercise and select the positions of the elements of the first Series that are in the second Series (★★☆)

> NOTE: Review the function `pd.Series.isin` (https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html)

In [227]:
#made new series since last expercise I used names and ages, something that wont match
s1 = pd.Series([10, 20, 30, 40, 50]) #series 1
s2 = pd.Series([20, 40, 60]) #series 2


#positions in s1 that are in s2
elements_in_s1 = s1[s1.isin(s2)]
print("\nElements in 's1' that are also in 's2':\n\n", elements_in_s1)


Elements in 's1' that are also in 's2':

 1    20
3    40
dtype: int64


#### Pandas Exercise 05: Use the series created in exercise 03 and list the elements that are not common between both series (★★☆)

In [230]:
#positions in s1 that are in s2
elements_in_s1 = s1[~s1.isin(s2)]
print("\nElements in 's1' that are NOT in 's2':\n\n", elements_in_s1)


Elements in 's1' that are NOT in 's2':

 0    10
2    30
4    50
dtype: int64


#### Pandas Exercise 06: Create a DataFrame of random numbers with 5 columns and 10 rows and sort one of its columns from smallest to largest (★★☆)

> NOTE: Review the function `pd.DataFrame.sort_values` (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [264]:
#Create a DataFrame of random numbers with 5 columns and 10 rows
df = pd.DataFrame(np.random.randint(1, 101, size=(10, 5)),
                  columns=['Col A', 'Col B', 'Col C', 'Col D', 'Col E'], index=['R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8', 'R9', 'R10'])
print("Random number DataFrame:\n\n", df)

Random number DataFrame:

      Col A  Col B  Col C  Col D  Col E
R1      84     28     46     28     53
R2      65     84     63     52     70
R3      44     88     10      1     60
R4      22      1     61     91      4
R5      65     49     61     40     63
R6     100     90     70     39     40
R7      27     90     32     73    100
R8      50     26     57     12      6
R9      28     93     93     10     28
R10     15     57     77     39     75


In [300]:
#sort one of its columns from smallest to largest
df_sorted = df.sort_values(by='Col C', ascending=True)

print("DataFrame sort column C smallest to largest:\n\n", df_sorted)

DataFrame sort column C smallest to largest:

      Col A  Col B  Col C  Col D  Col E
R3      44     88     10      1     60
R7      27     90     32     73    100
R1      84     28     46     28     53
R8      50     26     57     12      6
R5      65     49     61     40     63
R4      22      1     61     91      4
R2      65     84     63     52     70
R6     100     90     70     39     40
R10     15     57     77     39     75
R9      28     93     93     10     28


#### Pandas Exercise 07: Modify the name of the 5 columns of the above DataFrame to the following format: `N_column` where `N` is the column number (★★☆)

> NOTE: Review the function `pd.DataFrame.sort_values` (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [342]:
df.columns = [f"{i+1}_column" for i in range(df.shape[1])]
print("DataFrame with renamed Columns:\n\n", df)

DataFrame with renamed Columns:

         1_column  2_column  3_column  4_column  5_column
Row_1         84        28        46        28        53
Row_2         65        84        63        52        70
Row_3         44        88        10         1        60
Row_4         22         1        61        91         4
Row_5         65        49        61        40        63
Row_6        100        90        70        39        40
Row_7         27        90        32        73       100
Row_8         50        26        57        12         6
Row_9         28        93        93        10        28
Row_10        15        57        77        39        75


#### Pandas Exercise 08: Modify the index of the rows of the DataFrame of exercise 06 (★★☆)

> NOTE: Review the function `pd.DataFrame.sort_values` (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

In [353]:
df.index = [f"Row_{i+1}" for i in range(df.shape[0])]
print("DataFrame with renamed Rows:\n\n", df)

DataFrame with renamed Rows:

         1_column  2_column  3_column  4_column  5_column
Row_1         84        28        46        28        53
Row_2         65        84        63        52        70
Row_3         44        88        10         1        60
Row_4         22         1        61        91         4
Row_5         65        49        61        40        63
Row_6        100        90        70        39        40
Row_7         27        90        32        73       100
Row_8         50        26        57        12         6
Row_9         28        93        93        10        28
Row_10        15        57        77        39        75
