# Lab 05 -Pandas Library- Annette Bazan

##### Pandas Library:
1. Pandas is another powerful Python library in manipulating data
2. It is built on top of NumPy
3. While NumPy handles mainly numeric dtypes, Pandas library handles both string and numeric values
4. There are two main data structures in Pandas: Series and Data Frames
   - Series are 1-D (one-dimensional) structure
   - Data Frames(df) are 2-D (two dimensional) structures: 2_D is a tabular dataset(having rows and columns).
5. The most reliable source for Pandas is: https://pandas.pydata.org/

### Installing and Importing Pandas Library

In [1]:
# Let's install Pandas
!pip install pandas



In [27]:
# Let's import pandas
import numpy as np
import pandas as pd # pd is the alias name for Pandas

In [3]:
# Let's check the version of Pandas that we are using here
print(pd.__version__)

2.2.2


### Data Structures in Pandas
1. Series: It is made up of two parts:
 - Index (label)
 - The values (1-D column)
2. DataFrames (df): It is made up of rows and columns

#### 1. Series

In [7]:
# 1. Let's create a Pandas Series using a Python List
s = pd.Series([67,45,23,100,105], name= 'Grades')
s

0     67
1     45
2     23
3    100
4    105
Name: Grades, dtype: int64

In [6]:
# Let's see the type of s
type(s)

pandas.core.series.Series

In [14]:
# 2. Let's create a Pandas Series using a Python dictionary
s1 = pd.Series({100:"swan", 2:"dove", 'Thanksgiving':"turkey", 4:"parrot", 'Bazan':"eagle"}, name= 'birds')
s1

100               swan
2                 dove
Thanksgiving    turkey
4               parrot
Bazan            eagle
Name: birds, dtype: object

In [15]:
# Let's see the type of s1
type(s1)

pandas.core.series.Series

In [16]:
# Let's see the index of s1
s1.index

Index([100, 2, 'Thanksgiving', 4, 'Bazan'], dtype='object')

In [17]:
# Let's check the dimensionality 
s1.ndim

1

### Indexing and Slicing Pandas Series

In [19]:
# Let's have eagle as the output; indexing a Series
s1['Bazan']

'eagle'

In [21]:
# Slicing a Series; display the first 3 elements of s
s[:3]

0    67
1    45
2    23
Name: Grades, dtype: int64

##### 2. DataFrames(df)
1. A Pandas' DataFrame is a 2-D array with rows and columns.
2. DataFrames are the most common used data structure.
3. There are two axes in a DataFrame: axis= 0 (rows) axis = 1 (columns)
4. Python dictionaries are the only 2-D data structure in Python
5. Pandas Dataframes can be created from a collection of Python dictionaries.

In [22]:
# 1. Creating df using Python dictionaries
d = [{100:"swan", 2:"dove", 'Thanksgiving':"turkey", 4:"parrot", 'Bazan':"eagle"},
{1:'a', 2:'b', 3:'c', 4:'d', 5:'e'}]
d


[{100: 'swan',
  2: 'dove',
  'Thanksgiving': 'turkey',
  4: 'parrot',
  'Bazan': 'eagle'},
 {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}]

In [23]:
# Let's check the type of d
type(d)

list

In [36]:
# Let's convert d list to a dataframe named df
df1= pd.DataFrame(d)
df1

Unnamed: 0,100,2,Thanksgiving,4,Bazan,1,3,5
0,swan,dove,turkey,parrot,eagle,,,
1,,b,,d,,a,c,e


In [37]:
# Let's check the type of df
type(df1)

pandas.core.frame.DataFrame

**Observations**
- Each dictionary becomes a row in the DataFrame, and the keys from both dictionaries are combined to form the column headers.
- NaN means missing value the chart has 2 and 4 that match but 1,3,5 do not have any match in the bird dataset.
- The code uses a mix of integer and string keys in the dictionaries
- Pandas will preserve these as column names, and the DataFrame will handle the mixed key types without issue.
- The use of 2 and 4 as keys in both dictionaries might suggest an intent to align data,

In [29]:
# 2. Creating a df using NumPy arrays
array = np.array([[45,85000],[24,45000],[78,102000],[19,24000],[58,120000]]) 
array

array([[    45,  85000],
       [    24,  45000],
       [    78, 102000],
       [    19,  24000],
       [    58, 120000]])

In [38]:
# Let's convert array into a pandas df, changing names of columns and rows.
df_2 = pd.DataFrame(array, columns = ['Age', 'Income'], index=['Sam','Mike','Lilly','Sara','Bazan'])
df_2

Unnamed: 0,Age,Income
Sam,45,85000
Mike,24,45000
Lilly,78,102000
Sara,19,24000
Bazan,58,120000


**Observations**
* The DataFrame df_2 shows Sam, aged 45,earning $85,000

   * Mike, aged 24, earning $45,000

  * Lilly, aged 78, earning $102,000

  * Sara, aged 19, earning $24,000

  * Bazan, aged 58, earning $120,000 (even though I am not 58!!!)


### Data Attributes: Add the information that each attribute provides
- data attributes provide more information about the dataset. Some of them are as follows for df_2:
- shape = (5,2)
- ndim = 2
- size = 10
- dytpes = Age int32, Income int32, dtype=Object
- columns = (['Age', 'Income'], dtype='object')
- index = (['Sam', 'Mike', 'Lilly','Sara', 'Bazan'], dtype='object')

In [39]:
# Shape
df_2.shape

(5, 2)

In [40]:
# Ndim
df_2.ndim

2

In [41]:
# Size
df_2.size

10

In [42]:
# Dtypes
df_2.dtypes

Age       int32
Income    int32
dtype: object

In [43]:
# Columns
df_2.columns

Index(['Age', 'Income'], dtype='object')

In [44]:
# Index
df_2.index

Index(['Sam', 'Mike', 'Lilly', 'Sara', 'Bazan'], dtype='object')

In [46]:
# Let's load the movies dataset
df = pd.read_csv('movies.csv')

In [49]:
# Making sure the dataset has been loaded
df.head(8)

Unnamed: 0,Rank,Title,Studio,Gross,Year
0,1,Avengers: Endgame,Buena Vista,"$2,796.30",2019
1,2,Avatar,Fox,"$2,789.70",2009
2,3,Titanic,Paramount,"$2,187.50",1997
3,4,Star Wars: The Force Awakens,Buena Vista,"$2,068.20",2015
4,5,Avengers: Infinity War,Buena Vista,"$2,048.40",2018
5,6,Jurassic World,Universal,"$1,671.70",2015
6,7,Marvel's The Avengers,Buena Vista,"$1,518.80",2012
7,8,Furious 7,Universal,"$1,516.00",2015


In [48]:
df.tail()

Unnamed: 0,Rank,Title,Studio,Gross,Year
777,778,Yogi Bear,Warner Brothers,$201.60,2010
778,779,Garfield: The Movie,Fox,$200.80,2004
779,780,Cats & Dogs,Warner Brothers,$200.70,2001
780,781,The Hunt for Red October,Paramount,$200.50,1990
781,782,Valkyrie,MGM,$200.30,2008


#### Conclusion:
1. Series is like Python lists, series is one dimensional you can also assign names to your series.
2. Pandas by deafult assigns index to every object. It is not part of the dataset.
3. s and s1 are the same method of pandas to create it just because it was dictionary and list doesn't mean it is different.
4. df or DataFrames are two dimensional it has rows and columns.
5. Panda series has two parts index and column.
6. Index is an attribute like ndim for series.
7. 0-5 is the deafult pandas index but when we assign with dictionary the keys assign the index 1-5( or anything we want).
8. It can be index it is ordered in a series and slicing remember. Index is name of dataset[''] and Slicing is range so its name of dataset[:].
9. You can take a list with two dictionaries to make a df [{}] the name of the columns will be showed on top and the index would be on the left side of the table.
10. NaN means not a number it is a missing value.
11. The resulting DataFrame is sparse due to the differing keys between the two dictionaries. Only two columns (2 and 4) have values in both rows, while the other six columns (100, 'Thanksgiving', 'Bazan', 1, 3, 5) have NaN in one row. This sparsity could affect downstream analysis if not intentional.
12. Always remember to import numpy and pandas if using either or both
13. In the final remember how to name the columns and name the rows.
14. dtypes provide the variables in each dataset and object dtype is not integer or float. It can be string or mixture of string and number.
15. Pandas DataFrames are incredibly versatile, making it easy to manipulate and analyze structured data with intuitive, table-like operations

#### End of Lab 05