
# Day 11 - Introduction to Pandas: Series and DataFrames




## Why is Pandas Important?

Pandas is a cornerstone of data analysis in Python. It allows you to work with structured data efficiently, making it easy to clean, manipulate, and analyze data sets of all sizes. Whether dealing with small data or "big data," Pandas provides the tools you need to transform and analyze your data effectively.



## Series and DataFrames

### What is a Series?
A Series is a one-dimensional array-like object that can hold any data type, such as integers, floats, strings, and even Python objects. A Series is similar to a column in a spreadsheet or a database table.


In [None]:

import pandas as pd

# Creating a Series
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)



### What is a DataFrame?
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure. You can think of it as a spreadsheet or a SQL table, where data is aligned in rows and columns.


In [None]:

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)



## Creating and Manipulating DataFrames

Now that we've covered the basics, let's explore some essential operations you can perform on DataFrames.


In [None]:

# Selecting a column
names = df['Name']
print(names)


In [None]:

# Filtering rows where Age > 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)


In [None]:

# Adding a new column
df['Salary'] = [70000, 80000, 90000]
print(df)



## Use Case: Loading and Exploring Movie Ratings

For this tutorial, we will use the MovieLens Latest Datasets, particularly the small version, which contains 100,000 ratings from 600 users on 9,000 movies. 

This dataset is ideal for demonstrating the functionality of Pandas with manageable data size, making it easy to explore and analyze within a notebook.


In [None]:

import requests

url_movies = 'https://raw.githubusercontent.com/ricardogr07/100-days-of-python-and-data-science/main/11%20-%20Pandas%20Series%20and%20Dataframes/ml-latest-small/movies.csv'
movies_df = pd.read_csv(url_movies)

# Display the first few rows of the DataFrame
print("First few rows of the movies dataset:")
print(movies_df.head())


In [None]:

# Get a summary of the DataFrame's structure and contents
print("Summary information about the dataset:")
print(movies_df.info())

# Check for any missing values in the dataset
print("Missing values in the dataset:")
print(movies_df.isnull().sum())


In [None]:

# Split the genres column to separate multiple genres into individual entries
movies_df['genres'] = movies_df['genres'].str.split('|')

# Explode the list of genres into individual rows
movies_exploded = movies_df.explode('genres')

# Count the occurrences of each genre
genre_counts = movies_exploded['genres'].value_counts()

print("Number of movies per genre:")
print(genre_counts)
