# Pandas Library 

Pandas is a powerful open-source data manipulation and analysis library for Python. It provides data structures for efficiently storing and manipulating large datasets, along with tools for working with structured data. The primary data structures in Pandas are Series and DataFrame.

Some functionalities : 

Series:

A one-dimensional labeled array capable of holding any data type.
Can be thought of as a column in a spreadsheet or a simple dataset.
Similar to a NumPy array but with an associated labeled index.

DataFrame:

A two-dimensional table with rows and columns, similar to a spreadsheet or SQL table.
Each column in a DataFrame is a Series.

-------------------------------------------------------------------------------------------

**Comparison with NumPy:**

Data Structures:

NumPy primarily provides support for multi-dimensional arrays (ndarrays).
Pandas introduces higher-level data structures like Series and DataFrame, built on top of NumPy arrays.

Indexing:

NumPy arrays are implicitly indexed.
Pandas provides explicit indexes with labels, making it more suitable for labeled data.

Use Case:

NumPy is more focused on numerical computing and mathematical operations on arrays.
Pandas is designed for data manipulation and analysis, making it suitable for handling real-world datasets.**



In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/anime-tv-shows-dataset-2023/anime_data.csv


# Pandas Series Basics:

A Pandas Series is a one-dimensional labeled array that can hold any data type. It is similar to a column in a spreadsheet or a single column in a DataFrame. Each element in a Series has a label, which is referred to as its index.

In [2]:
# Creating a Series from a list
data = [10, 20, 30, 40, 50]
series_1 = pd.Series(data)
print(series_1)   # prints a single coloumn markes with the index value 

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [3]:
# Creating a Series with custom index
data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']
series_2 = pd.Series(data, index=index)
print(series_2)

a    10
b    20
c    30
d    40
e    50
dtype: int64


In [4]:
# Accessing elements by index
print(series_2['c'])  # Output: 30    # this is out custom indexing 
print(series_2[0])     # Output: 20   # this is the default 1D array indexing 

30
10


In [5]:
# Performing operations on Series
series_3 = pd.Series([1, 2, 3, 4])
series_4 = pd.Series([5,6,7,8])     # if index are not being matched than in that case we get Output as NaN ie Not a Number 
series_sum = series_4 + series_3
print(series_sum)

0     6
1     8
2    10
3    12
dtype: int64


In [6]:
# Performing operations on Series
series_3 = pd.Series([1, 2, 3, 4])     # if index are not being matched than in that case we get Output as NaN ie Not a Number 
series_sum = series_2 + series_3
print(series_sum)

a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
0   NaN
1   NaN
2   NaN
3   NaN
dtype: float64


****The fillna() function in Pandas is used to fill missing (NaN) values with a specified value or a set of values. It's a versatile method that allows you to replace NaN values in a Series or DataFrame with a constant value, a calculated value, or values from another Series or DataFrame.****

In [7]:
# Convert the resulting Series to int after the addition and replace NaN with 0
series_sum = (series_4 + series_3).fillna(0).astype(int)
print(series_sum)

0     6
1     8
2    10
3    12
dtype: int64


In [8]:
series_3 = pd.Series([1, 2, 3, 4])     # if index are not being matched than in that case we get Output as NaN ie Not a Number 
series_sum = (series_2 + series_3).fillna(0).astype(int)
print(series_sum)

a    0
b    0
c    0
d    0
e    0
0    0
1    0
2    0
3    0
dtype: int64


In [9]:
vec = ["Car","Boat" ,"Bike"]  # Python List to series 
index = ["vec1","vec2","vec3"]
ser = pd.Series(vec , index = index)
print(ser)

vec1     Car
vec2    Boat
vec3    Bike
dtype: object


# DataFrame in Pandas:

A DataFrame is a two-dimensional labeled data structure with columns that can be of different data types. It is similar to a spreadsheet or SQL table, where data is arranged in rows and columns. In a DataFrame, each column is a Pandas Series, and the columns share a common index.

**Creating a DataFrame:**

You can create a DataFrame using various methods, such as from a dictionary, from a NumPy array, or by reading data from external sources like CSV files.

****Syntax****

> pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

data: The data for the DataFrame. This can be a dictionary, a 2D array (list of lists), a Series, or another DataFrame.

index: The row labels (index) for the DataFrame. If not specified, default integer indices are used.

columns: The column labels for the DataFrame. If not specified, default integer columns are used.

dtype: Data type to force. If specified, the entire DataFrame will be cast to that data type.

copy: Copy data from inputs. Default is False.



In [10]:
# Creating data from Dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'San Francisco', 'Los Angeles']}

df = pd.DataFrame(data)
print(df)

      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles


****Specifying Custom Index and Columns****

In [11]:

df_custom = pd.DataFrame(data, index=['id1', 'id2','id3'], columns=['Name', 'Age'])
print(df_custom)

        Name  Age
id1    Alice   25
id2      Bob   30
id3  Charlie   35


In [12]:
# Accessing a single column
print(df['Name'])

print()
# Accessing multiple columns
print(df[['Name', 'City']])

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

      Name           City
0    Alice       New York
1      Bob  San Francisco
2  Charlie    Los Angeles


In [13]:
#Add a new col
df['Salary'] = [50000, 60000, 70000]
print(df)


      Name  Age           City  Salary
0    Alice   25       New York   50000
1      Bob   30  San Francisco   60000
2  Charlie   35    Los Angeles   70000


In [14]:
# Filtering based on a condition
filtered_df = df[df['Age'] > 30]
print(filtered_df)

      Name  Age         City  Salary
2  Charlie   35  Los Angeles   70000


In [15]:
myData = [('Boat',1),('car',2),('bike',3)]
myData_Frame = pd.DataFrame(myData , columns = ["veichle","count"])
print(myData_Frame)

  veichle  count
0    Boat      1
1     car      2
2    bike      3


In [16]:
# check all datatypes 
myData_Frame.dtypes
# note objects in pandas are also used to denote strings 

veichle    object
count       int64
dtype: object

# Importing datasets as dataframe in pandas (csv)

CSV stands for Comma-Separated Values. It is a plain text file format that uses a simple structure to store tabular data (data in rows and columns). In a CSV file, each line represents a row of data, and the values within each row are separated by commas. The first row often contains headers, which describe the contents of each column.

e.g

Name, Age, City
John, 25, New York
Alice, 30, San Francisco
Bob, 28, Chicago


While Excel and other spreadsheet programs are valuable tools for data analysis and visualization, CSV is preferred in data science workflows due to its simplicity, compatibility, and ease of integration with programming languages and tools commonly used in the field. Additionally, CSV files are more suited for automated processing and scripting, which is a common requirement in data science pipelines.

**For Jupyter User**

import pandas as pd

Replace 'your_file.csv' with the actual file path or URL of your CSV file
file_path = 'your_file.csv'  file path is like C\user\abc\document\file_name

 Use the read_csv() function to import the dataset into a DataFrame
df = pd.read_csv(file_path)

Display the first few rows of the DataFrame to inspect the data
df.head()

Good to go !

# Importing a DataSet
1. Go To File 
2. Add Data
3. Search any csv dataset of your choice 
4. You will see a + icon to add a dataset 
5. click +   that means data has been added 
6. now check the data directory and file name 
7. copy the path and just paste it 

In [17]:
df = pd.read_csv('../input/anime-tv-shows-dataset-2023/anime_data.csv')

In [18]:
df.info()  # summarise the dataset 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4700 entries, 0 to 4699
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  4700 non-null   int64  
 1   Name        4700 non-null   object 
 2   Episodes    4667 non-null   float64
 3   Release     4700 non-null   object 
 4   Members     4700 non-null   int64  
 5   Score       4692 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 220.4+ KB


head() and tail():

df.head(): Returns the first 5 rows of the DataFrame.
df.tail(): Returns the last 5 rows of the DataFrame.

In [19]:
df.head() # First five

Unnamed: 0.1,Unnamed: 0,Name,Episodes,Release,Members,Score
0,0,Sousou no Frieren,28.0,Sep 2023 -,493571,9.13
1,1,Fullmetal Alchemist: Brotherhood,64.0,Apr 2009 - Jul 2010,3292928,9.09
2,2,Steins;Gate,24.0,Apr 2011 - Sep 2011,2526417,9.07
3,3,Gintama°,51.0,Apr 2015 - Mar 2016,620676,9.06
4,4,Shingeki no Kyojin Season 3 Part 2,10.0,Apr 2019 - Jul 2019,2227792,9.05


In [20]:
df.tail() # last five 

Unnamed: 0.1,Unnamed: 0,Name,Episodes,Release,Members,Score
4695,4695,Shounen Santa no Daibouken!,24.0,Apr 1996 - Sep 1996,482,
4696,4696,Shounen Tokugawa Ieyasu,20.0,Apr 1975 - Sep 1975,636,
4697,4697,Shouwang Zhengfeng,26.0,Nov 2016 - Dec 2016,88,
4698,4698,Shouwang Zhengfeng Shen Jiang Zhi Nu,26.0,-,47,
4699,4699,Shouwang Zhengfeng: Yuanshi Zhi Li,26.0,2017 -,59,


describe():

df.describe(): Generates descriptive statistics of the numerical columns, such as mean, standard deviation, minimum, maximum, etc.

In [21]:
df.describe()  # provides statistical analysyis

Unnamed: 0.1,Unnamed: 0,Episodes,Members,Score
count,4700.0,4667.0,4700.0,4692.0
mean,2349.5,31.027641,157109.4,6.854723
std,1356.917462,84.787108,332358.7,0.811294
min,0.0,2.0,45.0,2.9
25%,1174.75,12.0,4983.25,6.32
50%,2349.5,13.0,32819.5,6.87
75%,3524.25,26.0,153050.8,7.39
max,4699.0,3057.0,3884680.0,9.13


shape:

df.shape: Returns a tuple representing the dimensions (number of rows, number of columns) of the DataFrame.

In [22]:
 df.dtypes   # provides datatype for each coloumn

Unnamed: 0      int64
Name           object
Episodes      float64
Release        object
Members         int64
Score         float64
dtype: object

In [23]:
df.shape

(4700, 6)

columns:

df.columns: Returns a list of column names in the DataFrame.

The "Unnamed: 0" column that you see in your DataFrame likely represents the default index column that Pandas adds when reading a CSV file that already has an index column. 

In [24]:
df.columns

Index(['Unnamed: 0', 'Name', 'Episodes', 'Release', 'Members', 'Score'], dtype='object')

In [25]:
df['Episodes'].head()    # fetching particular column with coloumn name 

0    28.0
1    64.0
2    24.0
3    51.0
4    10.0
Name: Episodes, dtype: float64

In [26]:
df['Score']>8    # returns boolean after matching certain coloumns 

0        True
1        True
2        True
3        True
4        True
        ...  
4695    False
4696    False
4697    False
4698    False
4699    False
Name: Score, Length: 4700, dtype: bool

In [27]:
 df['Score'].value_counts()  # returns count of distinct values

Score
7.28    35
7.15    34
7.22    34
7.16    33
6.53    33
        ..
4.97     1
4.95     1
4.91     1
4.89     1
2.90     1
Name: count, Length: 431, dtype: int64

unique():

df['column_name'].unique(): Returns an array of unique values in a specific column.

In [28]:
df['Name'].unique()

array(['Sousou no Frieren', 'Fullmetal Alchemist: Brotherhood',
       'Steins;Gate', ..., 'Shouwang Zhengfeng',
       'Shouwang Zhengfeng Shen Jiang Zhi Nu',
       'Shouwang Zhengfeng: Yuanshi Zhi Li'], dtype=object)

In [29]:
df['Score'].mean()    # mean value for a coloumn

6.854722932651321

In [30]:
df.sort_values(by='Score', ascending = False).head()   # highest scored top five anime

Unnamed: 0.1,Unnamed: 0,Name,Episodes,Release,Members,Score
0,0,Sousou no Frieren,28.0,Sep 2023 -,493571,9.13
1,1,Fullmetal Alchemist: Brotherhood,64.0,Apr 2009 - Jul 2010,3292928,9.09
2,2,Steins;Gate,24.0,Apr 2011 - Sep 2011,2526417,9.07
3,3,Gintama°,51.0,Apr 2015 - Mar 2016,620676,9.06
4,4,Shingeki no Kyojin Season 3 Part 2,10.0,Apr 2019 - Jul 2019,2227792,9.05


**Apart from these functions there are many other functions in panda**

In [31]:
df.loc[4]   # gives the info for particular row

Unnamed: 0                                     4
Name          Shingeki no Kyojin Season 3 Part 2
Episodes                                    10.0
Release                      Apr 2019 - Jul 2019
Members                                  2227792
Score                                       9.05
Name: 4, dtype: object

The set_index method in pandas is used to set a specific column as the DataFrame's index. The index is a crucial component of a DataFrame, providing labels for the rows. By default, DataFrames have a numerical index starting from 0, but sometimes it's more meaningful to use one of the existing columns as the index, especially when the data has a natural key.

df.set_index(keys, drop=True, inplace=False)

keys: The column or columns you want to set as the index. This can be a single column name or a list of column names if you want a multi-level index.

drop: If True, it removes the column(s) used as the new index from the DataFrame. If False, it keeps the column(s) in the DataFrame as a regular column(s).

inplace: If True, the change is made in place, and it doesn't return a new DataFrame. If False (the default), it returns a new DataFrame with the updated index.

If you want to reset the index to the default numerical index, you can use the reset_index method:

df.reset_index(inplace=True)

In [32]:
# we can use the setIndex method to set the id 
new_df = df.set_index('Unnamed: 0')

new_df.head()

Unnamed: 0_level_0,Name,Episodes,Release,Members,Score
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Sousou no Frieren,28.0,Sep 2023 -,493571,9.13
1,Fullmetal Alchemist: Brotherhood,64.0,Apr 2009 - Jul 2010,3292928,9.09
2,Steins;Gate,24.0,Apr 2011 - Sep 2011,2526417,9.07
3,Gintama°,51.0,Apr 2015 - Mar 2016,620676,9.06
4,Shingeki no Kyojin Season 3 Part 2,10.0,Apr 2019 - Jul 2019,2227792,9.05


# Subsetting the Data :

In [33]:
new_df = new_df[['Name','Episodes','Score']]   # if we need only certain coloumns 
new_df

Unnamed: 0_level_0,Name,Episodes,Score
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Sousou no Frieren,28.0,9.13
1,Fullmetal Alchemist: Brotherhood,64.0,9.09
2,Steins;Gate,24.0,9.07
3,Gintama°,51.0,9.06
4,Shingeki no Kyojin Season 3 Part 2,10.0,9.05
...,...,...,...
4695,Shounen Santa no Daibouken!,24.0,
4696,Shounen Tokugawa Ieyasu,20.0,
4697,Shouwang Zhengfeng,26.0,
4698,Shouwang Zhengfeng Shen Jiang Zhi Nu,26.0,


In [34]:
new_df['Score']>9   # specific rows with bool

Unnamed: 0
0        True
1        True
2        True
3        True
4        True
        ...  
4695    False
4696    False
4697    False
4698    False
4699    False
Name: Score, Length: 4700, dtype: bool

In [35]:
# new df.query method
new_df.query('Score > 9')  # actual values 
# note you can add multiple query using and eg  filtered_df = df.query('Age > 28 and City == "San Francisco"')

Unnamed: 0_level_0,Name,Episodes,Score
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Sousou no Frieren,28.0,9.13
1,Fullmetal Alchemist: Brotherhood,64.0,9.09
2,Steins;Gate,24.0,9.07
3,Gintama°,51.0,9.06
4,Shingeki no Kyojin Season 3 Part 2,10.0,9.05
5,Hunter x Hunter (2011),148.0,9.04
6,Bleach: Sennen Kessen-hen,13.0,9.03
7,Gintama',51.0,9.03
8,Gintama': Enchousen,13.0,9.02
9,Kaguya-sama wa Kokurasetai: Ultra Romantic,13.0,9.02


In [36]:
# Check for NaN values in the entire DataFrame
nan_check = df.isna()

# Display the DataFrame and NaN check result
print("Original DataFrame:")
print(df)
print("\nNaN Check:")
print(nan_check)

Original DataFrame:
      Unnamed: 0                                  Name  Episodes  \
0              0                     Sousou no Frieren      28.0   
1              1      Fullmetal Alchemist: Brotherhood      64.0   
2              2                           Steins;Gate      24.0   
3              3                              Gintama°      51.0   
4              4    Shingeki no Kyojin Season 3 Part 2      10.0   
...          ...                                   ...       ...   
4695        4695           Shounen Santa no Daibouken!      24.0   
4696        4696               Shounen Tokugawa Ieyasu      20.0   
4697        4697                    Shouwang Zhengfeng      26.0   
4698        4698  Shouwang Zhengfeng Shen Jiang Zhi Nu      26.0   
4699        4699    Shouwang Zhengfeng: Yuanshi Zhi Li      26.0   

                  Release  Members  Score  
0              Sep 2023 -   493571   9.13  
1     Apr 2009 - Jul 2010  3292928   9.09  
2     Apr 2011 - Sep 2011  2526

In [37]:
# Drop rows with NaN values
df_no_nan = df.dropna()

# Display the DataFrame after dropping NaN values
print("DataFrame after dropping NaN values:")
print(df_no_nan)

DataFrame after dropping NaN values:
      Unnamed: 0                                Name  Episodes  \
0              0                   Sousou no Frieren      28.0   
1              1    Fullmetal Alchemist: Brotherhood      64.0   
2              2                         Steins;Gate      24.0   
3              3                            Gintama°      51.0   
4              4  Shingeki no Kyojin Season 3 Part 2      10.0   
...          ...                                 ...       ...   
4687        4687                              Hanoka      12.0   
4688        4688                                Pupa      12.0   
4689        4689                      Vampire Holmes      12.0   
4690        4690                             Ladyspo      12.0   
4691        4691                              Ex-Arm      12.0   

                  Release  Members  Score  
0              Sep 2023 -   493571   9.13  
1     Apr 2009 - Jul 2010  3292928   9.09  
2     Apr 2011 - Sep 2011  2526417   9

In [38]:
# Fill NaN values with a specific value (e.g., 0)
df_filled = df.fillna(0)

# Display the DataFrame after filling NaN values
print("DataFrame after filling NaN values:")
print(df_filled)

DataFrame after filling NaN values:
      Unnamed: 0                                  Name  Episodes  \
0              0                     Sousou no Frieren      28.0   
1              1      Fullmetal Alchemist: Brotherhood      64.0   
2              2                           Steins;Gate      24.0   
3              3                              Gintama°      51.0   
4              4    Shingeki no Kyojin Season 3 Part 2      10.0   
...          ...                                   ...       ...   
4695        4695           Shounen Santa no Daibouken!      24.0   
4696        4696               Shounen Tokugawa Ieyasu      20.0   
4697        4697                    Shouwang Zhengfeng      26.0   
4698        4698  Shouwang Zhengfeng Shen Jiang Zhi Nu      26.0   
4699        4699    Shouwang Zhengfeng: Yuanshi Zhi Li      26.0   

                  Release  Members  Score  
0              Sep 2023 -   493571   9.13  
1     Apr 2009 - Jul 2010  3292928   9.09  
2     Apr 2011 

In [39]:
# Replace NaN values in column 'Score' with the mean of that column
mean_A = df['Score'].mean()
df_replace_mean = df['Score'].fillna(mean_A)

# Display the DataFrame after replacing NaN values
print("DataFrame after replacing NaN values based on condition:")
print(df_replace_mean)

DataFrame after replacing NaN values based on condition:
0       9.130000
1       9.090000
2       9.070000
3       9.060000
4       9.050000
          ...   
4695    6.854723
4696    6.854723
4697    6.854723
4698    6.854723
4699    6.854723
Name: Score, Length: 4700, dtype: float64


****Add Col****

In [40]:
new_df['new_col'] = new_df['Episodes'] / new_df['Score']
new_df.head()

Unnamed: 0_level_0,Name,Episodes,Score,new_col
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Sousou no Frieren,28.0,9.13,3.066813
1,Fullmetal Alchemist: Brotherhood,64.0,9.09,7.040704
2,Steins;Gate,24.0,9.07,2.646086
3,Gintama°,51.0,9.06,5.629139
4,Shingeki no Kyojin Season 3 Part 2,10.0,9.05,1.104972


**Add Row**

In [41]:
# Just concat two datasets

df_to_add = new_df.tail(1)  #lets append the last row again

pd.concat([new_df,df_to_add])
new_df.tail()

Unnamed: 0_level_0,Name,Episodes,Score,new_col
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4695,Shounen Santa no Daibouken!,24.0,,
4696,Shounen Tokugawa Ieyasu,20.0,,
4697,Shouwang Zhengfeng,26.0,,
4698,Shouwang Zhengfeng Shen Jiang Zhi Nu,26.0,,
4699,Shouwang Zhengfeng: Yuanshi Zhi Li,26.0,,
