**PANDAS LIBRARY IN PYTHON **
# what is pandas
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
# Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.
# What Can Pandas Do?
Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?

**pandas series**
# What is a Series?
A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [21]:
import pandas as pd

a = [1,2,3,4,5,6,7,8,9,10]

myvar = pd.Series(a)

print(myvar)

0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64


# What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.


In [22]:
#data frame in pandas
import pandas as pd

data = {
  "calories": [400, 300, 200],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df)

   calories  duration
0       400        50
1       300        40
2       200        45


In [23]:
#Named Indexes
import pandas as pd

data = {
  "calories": [400, 300, 200],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

      calories  duration
day1       400        50
day2       300        40
day3       200        45


# Read CSV Files
A simple way to store big data sets is to use CSV files (comma separated files).

CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In our examples we will be using a CSV file called 'data.csv'.

In [None]:
print('Name,City,Age,Gender,Occupation')
print('Alice,New York,25,Female,Engineer')
print('Bob,London,30,Male,Doctor')
print('Charlie,Paris,22,Male,Artist')
print('David,Tokyo,28,Male,Teacher')

Name,City,Age,Gender,Occupation
Alice,New York,25,Female,Engineer
Bob,London,30,Male,Doctor
Charlie,Paris,22,Male,Artist
David,Tokyo,28,Male,Teacher


**PANDAS DATAFRAME**
 * While NumPy is great for numerical operations, Pandas is more suited for the operations you described. Here's how you can perform those operations using Pandas:

In [None]:
import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 28],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)

# Selecting data
names = df['Name']  # Select the 'Name' column
alice_data = df[df['Name'] == 'Alice']  # Select row where Name is 'Alice'

# Filtering rows
age_above_25 = df[df['Age'] > 25]  # Select rows where Age is greater than 25

# Modifying data
df['Age'] = df['Age'] + 1  # Increase everyone's age by 1
df.loc[df['Name'] == 'Alice', 'City'] = 'Seattle'  # Change Alice's city to Seattle

# Print the modified DataFrame
print(df)

      Name  Age     City
0    Alice   26  Seattle
1      Bob   31   London
2  Charlie   23    Paris
3    David   29    Tokyo


In [None]:
import csv


In [7]:
%%writefile mydata.csv
Age,Name,City,Gender,Occupation,Salary
25,meggai,New York,Female,Engineer,60000
30,Bobby,London,Male,Doctor,100000
22,Charluhasan,Paris,Male,Artist,40000
28,Deva,Tokyo,Male,Teacher,55000

Overwriting mydata.csv


In [8]:
import pandas as pd
df = pd.read_csv('mydata.csv')
print(df)

   Age         Name      City  Gender Occupation  Salary
0   25       meggai  New York  Female   Engineer   60000
1   30        Bobby    London    Male     Doctor  100000
2   22  Charluhasan     Paris    Male     Artist   40000
3   28         Deva     Tokyo    Male    Teacher   55000


In [10]:
import pandas as pd

# 1. Read data from the CSV file
df = pd.read_csv('mydata.csv')

print("Original DataFrame:")
print(df)

# 2. Handle missing values
# Fill missing values in 'Age' with the median value
df['Age'] = df['Age'].fillna(df['Age'].median())

# Fill missing values in 'Salary' with the mean value
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

print("\nDataFrame after handling missing values:")
print(df)

# 3. Remove duplicate rows
df = df.drop_duplicates()

print("\nDataFrame after removing duplicates:")
print(df)

# 4. Convert data types
# Convert 'Age' and 'Salary' to integers
df['Age'] = df['Age'].astype(int)
df['Salary'] = df['Salary'].astype(int)

print("\nDataFrame after converting data types:")
print(df)

# Optionally, save the cleaned DataFrame to a new CSV file
df.to_csv('cleaned_data.csv', index=False)


Original DataFrame:
   Age         Name      City  Gender Occupation  Salary
0   25       meggai  New York  Female   Engineer   60000
1   30        Bobby    London    Male     Doctor  100000
2   22  Charluhasan     Paris    Male     Artist   40000
3   28         Deva     Tokyo    Male    Teacher   55000

DataFrame after handling missing values:
   Age         Name      City  Gender Occupation  Salary
0   25       meggai  New York  Female   Engineer   60000
1   30        Bobby    London    Male     Doctor  100000
2   22  Charluhasan     Paris    Male     Artist   40000
3   28         Deva     Tokyo    Male    Teacher   55000

DataFrame after removing duplicates:
   Age         Name      City  Gender Occupation  Salary
0   25       meggai  New York  Female   Engineer   60000
1   30        Bobby    London    Male     Doctor  100000
2   22  Charluhasan     Paris    Male     Artist   40000
3   28         Deva     Tokyo    Male    Teacher   55000

DataFrame after converting data types:
   Ag

In [13]:
import pandas as pd
df = pd.read_csv('mydata.csv')

In [12]:
df.describe()

Unnamed: 0,Age,Salary
count,4.0,4.0
mean,26.25,63750.0
std,3.5,25617.376915
min,22.0,40000.0
25%,24.25,51250.0
50%,26.5,57500.0
75%,28.5,70000.0
max,30.0,100000.0


In [None]:
df.groupby('Gender')['Salary'].mean()

In [11]:
df['Age'].sum()

105

In [20]:
import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [4, 5, 6], 'B': [7, 8, 9]})

# Concatenate vertically
vertical_concat = pd.concat([df1, df2])

# Concatenate horizontally
horizontal_concat = pd.concat([df1, df2], axis=1)

# Print the results
print("Vertical Concatenation:")
print(vertical_concat)

print("\nHorizontal Concatenation:")
print(horizontal_concat)

Vertical Concatenation:
   A  B
0  1  4
1  2  5
2  3  6
0  4  7
1  5  8
2  6  9

Horizontal Concatenation:
   A  B  A  B
0  1  4  4  7
1  2  5  5  8
2  3  6  6  9


In [19]:
import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})

# Merge DataFrames based on 'key' column
merged_df = pd.merge(df1, df2, on='key')

# Print the merged DataFrame
print(merged_df)

  key  value1  value2
0   B       2       4
1   C       3       5


In [18]:
import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'value1': [1, 2, 3]}, index=['A', 'B', 'C'])
df2 = pd.DataFrame({'value2': [4, 5, 6]}, index=['B', 'C', 'D'])

# Join DataFrames based on index
joined_df = df1.join(df2, how='inner')  # Example using inner join

# Print the joined DataFrame
print(joined_df)

   value1  value2
B       2       4
C       3       5


Absolutely! Pandas offers a suite of powerful tools and features that make it indispensable for data science professionals. Here’s a concise breakdown of its advantages:

### 1. Specialized Data Structures
- **DataFrames**: Two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).
- **Series**: One-dimensional labeled array capable of holding any data type (integers, strings, floating points, etc.).
- **Indexing and Slicing**: Advanced methods for indexing, slicing, and subsetting data that are more powerful and flexible compared to basic Python structures like lists or dictionaries.

### 2. Data Cleaning and Transformation
- **Handling Missing Data**: Functions such as `fillna()`, `dropna()`, and `isna()` help manage and impute missing values.
- **Removing Duplicates**: The `drop_duplicates()` method eliminates duplicate rows.
- **Data Type Conversion**: Methods like `astype()` allow conversion between different data types.

### 3. Data Analysis
- **Descriptive Statistics**: Functions like `describe()`, `mean()`, `median()`, `std()`, etc., provide quick insights into the data distribution and characteristics.
- **Aggregation and Grouping**: The `groupby()` function allows grouping data and applying aggregate functions like `sum()`, `count()`, and `mean()` to each group.
- **Filtering and Querying**: Capabilities to filter and query data using boolean indexing and `query()` method.

### 4. Data Visualization
- **Integration with Visualization Libraries**: Seamless integration with libraries like Matplotlib and Seaborn for creating plots and visualizations directly from Pandas DataFrames and Series.
- **Built-in Plotting**: Pandas provides basic plotting capabilities through its `plot()` method, which can be handy for quick visualizations.

### 5. Efficiency
- **Built on NumPy**: Pandas is built on top of NumPy, which provides efficient array operations and vectorized computations.
- **Optimized Performance**: Designed to handle large datasets efficiently, outperforming traditional Python data structures in terms of speed and memory usage.

### 6. Community and Resources
- **Active Community**: Large community support through forums, Q&A sites, and contributions.
- **Extensive Documentation**: Comprehensive and well-maintained documentation and tutorials.

In summary, Pandas enhances productivity and efficiency in data science workflows by offering sophisticated data structures, powerful data manipulation capabilities, and seamless integration with visualization tools. Its performance and support make it a go-to tool for handling and analyzing complex datasets.