### **Data Manipulation with Pandas**

**Pandas** is a powerful and flexible open-source library in Python that provides data structures and data analysis tools. It is widely used for data manipulation, data cleaning, and data exploration, and is designed to work with structured data like tables. The two main data structures in Pandas are Series and DataFrames.

#### 1. **Pandas Series**

A Series is a one-dimensional array-like object in Pandas that can hold any type of data such as integers, floats, strings, or even other Python objects. Each element in a Series has a corresponding index label, which allows for easy access to data.

##### **Creating a Pandas Series**

You can create a Series using the `pd.Series()` function, where pd is the typical alias for Pandas:


In [1]:
import pandas as pd

# Creating a Series from a list
data = [10, 20, 30, 40]
height = pd.Series(data)
print(height)

0    10
1    20
2    30
3    40
dtype: int64


•	The left column is the index, and the right column is the data.

•	By default, Pandas assigns integer indexes starting from 0.


##### **Custom Index**

You can also specify custom indices for a Series:


In [2]:
y = pd.Series([100, 200, 300])
print(y)

0    100
1    200
2    300
dtype: int64


In [3]:
s = pd.Series([100, 200, 300], index = ('a', 'b', 'c'))
print(s)

a    100
b    200
c    300
dtype: int64


##### **Accessing Data in a Series**

You can access elements in a Series using both integer and label-based indexing:


In [4]:
# Accessing elements by index
print(y[1])  # Output: 200
print(s['c'])  # Output: 300

200
300


#### 2. Pandas DataFrame
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure. Think of it as a table or a spreadsheet in Python, where rows represent records and columns represent attributes.

##### **Creating a Pandas DataFrame**

You can create a DataFrame in many ways, including from a dictionary, list of lists, or a NumPy array.

##### **1.	From a dictionary:**


In [5]:
data = {
    'Name': ['Abiodun', 'John', 'Uche'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Salary
0,Abiodun,25,50000
1,John,30,60000
2,Uche,35,70000


#### df.to_string() function

In Python's Pandas library, ```df.to_string()``` is a versatile method used to convert a DataFrame into a string representation. This string can then be printed to the console, written to a file, or used in other text-based operations.

In [6]:
import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Abiodun', 'John', 'Uche'],
        'Age': [25, 30, 35],
      'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)

# Convert the DataFrame to a string
string_representation = df.to_string(index=False)

# Print the string representation
print(string_representation)

   Name  Age  Salary
Abiodun   25   50000
   John   30   60000
   Uche   35   70000


#### **2.	From a list of lists:**

In [7]:
import pandas as pd
data = [
    ['Ope', 25, 50000],
    ['John', 30, 60000],
    ['Uche', 35, 70000]
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Salary'])
print(df)


   Name  Age  Salary
0   Ope   25   50000
1  John   30   60000
2  Uche   35   70000


##### **Accessing Data in a DataFrame**

You can access rows, columns, and individual elements using various methods.

##### Accessing Columns: 

You can access a column like you would access a dictionary value or as an attribute.


In [8]:
# Accessing a single column
print(df['Name'])  # Output: Pandas Series

# Accessing multiple columns
print(df[['Salary', 'Age']])  # Output: DataFrame with selected columns


0     Ope
1    John
2    Uche
Name: Name, dtype: object
   Salary  Age
0   50000   25
1   60000   30
2   70000   35


##### Accessing Rows: Use loc or iloc for row access.

–	loc: Access by label.

–	iloc: Access by integer index.


In [9]:
# Access by integer index (first row)
print(df.iloc[0])  # Output: Pandas Series

# Access by label (first row)
print(df.loc[0])  # Output: Pandas Series

# Slicing rows
print(df.iloc[0:2])  # Output: DataFrame of first two rows


Name        Ope
Age          25
Salary    50000
Name: 0, dtype: object
Name        Ope
Age          25
Salary    50000
Name: 0, dtype: object
   Name  Age  Salary
0   Ope   25   50000
1  John   30   60000


#### 3. Basic Operations with DataFrames
Pandas provides a variety of functions to perform basic operations on DataFrames, such as selecting, filtering, adding columns, and modifying the data.

##### **Adding a New Column**

You can add a new column by assigning a new value to it:


In [10]:
df['City'] = ['Ibadan', 'Lagos', 'Benin']
print(df)

   Name  Age  Salary    City
0   Ope   25   50000  Ibadan
1  John   30   60000   Lagos
2  Uche   35   70000   Benin


##### **Filtering Data**

You can filter rows based on conditions:


In [11]:
# Filter rows where Salary is greater than 55000
ab_55 = df['Salary']
above_55 = df[ab_55 > 55000]
print(above_55)


   Name  Age  Salary   City
1  John   30   60000  Lagos
2  Uche   35   70000  Benin


##### **Dropping Columns or Rows**

You can remove columns or rows using the `drop()` method:


In [12]:
##### Dropping a column
df = df.drop(columns=['City'])

# Dropping a row
df = df.drop(0)  # Removes the row with index 1
print(df)


   Name  Age  Salary
1  John   30   60000
2  Uche   35   70000


#### 4. Reading/Writing Data Files
Pandas makes it easy to read from and write to different file formats such as CSV, Excel, JSON, and SQL databases.

##### **Reading CSV Files**

You can read data from a CSV file using the `pd.read_csv()` function:


In [13]:
# Reading a CSV file
import pandas as pd
df2 = pd.read_csv('obesity_matadata.csv')
print(df2)


FileNotFoundError: [Errno 2] No such file or directory: 'obesity_matadata.csv'

#### To import data located outside the directory

for instance, data on the desktop

In [None]:
# Reading a CSV file
import pandas as pd
df3 = pd.read_csv('obesity_matadata.csv')
print(df3)

In [None]:
df4 = pd.read_csv('students_scores.csv')
print(df4)

In [None]:
d4_new = df4.drop(columns=['age'])
print(d4_new)

In [None]:
d4_new.to_csv('student_scorenew.csv', index=False)

##### **Writing CSV Files**

You can write a DataFrame to a CSV file using the `to_csv()` function:


In [None]:
# Writing to a CSV file
df2.to_csv('df2_data.csv', index=False)  # index=False prevents writing the row indices


##### **Reading Excel Files**

To read Excel files, you can use `pd.read_excel()`, but you’ll need the `openpyxl` library for Excel support:


In [None]:
# Reading from an Excel file
df = pd.read_excel('df2_data.xlsx', sheet_name='Sheet1')
print(df)


FileNotFoundError: [Errno 2] No such file or directory: 'df2_data.xlsx'

In [None]:
df3 = pd.read_excel('my-first-data.xlsx')
print(df3)

##### **Writing Excel Files**

You can write a DataFrame to an Excel file using the `to_excel()` function:

In [None]:
# Writing to an Excel file
df3.to_excel('my-second-data.xlsx', sheet_name='Sheet1', index=False)


#### 5. Other DataFrame Operations
Pandas also provides various useful functions for data cleaning, reshaping, and summarizing data.

##### **Descriptive Statistics:** 

You can use functions like `mean(), sum(), min(), max(), and describe()` to summarize data.


In [None]:
# Basic statistics
print(df.describe())  # Summary of numeric columns
print(df['Age'].var())  # Mean salary


##### **Handling Missing Data:** 

You can detect and handle missing values using functions like `isnull(), fillna(), and dropna()`.

In [None]:
# Detect missing values
print(df.isnull())

# Fill missing values
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Drop rows with missing values
df = df.dropna()


#### DataFrame Sorting: 
    
You can sort DataFrame rows by a specific column using sort_values():

In [15]:
# Sort by 'Age' column
df_sorted = df.sort_values(by='Age')
print(df_sorted)

   Name  Age  Salary
1  John   30   60000
2  Uche   35   70000


##### Arrange in ascending/descending order

In [14]:
names = ['Bola','Deola', 'Okoro','Afusat', 'Clement']

# Ascending order
ascending_names = sorted(names)
print("Ascending:", ascending_names)

# Descending order
descending_names = sorted(names, reverse=True)
print("Descending:", descending_names)


Ascending: ['Afusat', 'Bola', 'Clement', 'Deola', 'Okoro']
Descending: ['Okoro', 'Deola', 'Clement', 'Bola', 'Afusat']


In [16]:
import pandas as pd
values = [4, 6, 8, 1, 0, 9]
sorted_values = sorted(values)
sorted_values

[0, 1, 4, 6, 8, 9]

### Ranking a list

In [17]:
import pandas as pd
# Sample data (assuming it's in a list)
data = [0.444, 0.051, 0.337, 0.705, 0.423, 0.423, 0.453, 0.514, 0.433, 0.458, 0.481, 0.481, 0.500, 0.514, 0.405]

# Create a DataFrame
df = pd.DataFrame({'values': data})

# Rank the values, handling ties appropriately
df['rank'] = df['values'].rank(method='dense', ascending=False)

# Remove decimal part of the ranks
df['rank'] = df['rank'].astype(int)

print(df)

    values  rank
0    0.444     7
1    0.051    12
2    0.337    11
3    0.705     1
4    0.423     9
5    0.423     9
6    0.453     6
7    0.514     2
8    0.433     8
9    0.458     5
10   0.481     4
11   0.481     4
12   0.500     3
13   0.514     2
14   0.405    10


#### Summary

•	Series: One-dimensional labeled arrays.

•	DataFrames: Two-dimensional tables of data with labeled rows and columns.

•	Basic operations like filtering, adding/removing columns, and modifying data are easy with Pandas.

•	Pandas makes reading from and writing to files (CSV, Excel) very straightforward.

•	Useful methods include describe(), dropna(), fillna(), sort_values(), and more.

•   Pandas is an essential library for any data manipulation tasks in Python, making it a cornerstone for data science and machine learning workflows.
