#### Pandas is like a supercharged version of Excel in Python. It allows you to load, manipulate, analyze, and visualize data in a way that is both easy and powerful. Pandas is built on top of numpy, which means it inherits a lot of numpy's speed and efficiency.

**Core Data Structures**
- Series: A one-dimensional labeled array capable of holding any data type (integer, string, float, etc.).
- DataFrame: A two-dimensional labeled data structure with columns of potentially different types, similar to a table in a database or a spreadsheet.


**importing pandas**

In [342]:
import pandas as pd

# **Pandas - Series**

**Series** in pandas is a fundamental data structure that represents a one-dimensional array of indexed data. It can hold any type of **data—integers, strings, floats, Python objects**, etc. The Series object is built on top of the NumPy array and is very similar to it but with additional capabilities like handling missing data. The indices of a pandas Series are more flexible than those in a simple NumPy array.

## Creating a Series

In [343]:
s = pd.Series([1, 3, 5, 7, 9])
print(s)

0    1
1    3
2    5
3    7
4    9
dtype: int64


**Key Attributes**

**Values:** The data in the Series.

**Index:** The index (labels) of each data point.

## **Common Methods of Series**


**s.describe():** Provides a quick summary of the data.

This method gives a statistical summary of the Series, including count, mean, standard deviation, minimum, maximum, and quartile values.

In [344]:
# Creating a Series
s = pd.Series([1, 3, 5, 7, 9])

# Descriptive statistics
print(s.describe())

count    5.000000
mean     5.000000
std      3.162278
min      1.000000
25%      3.000000
50%      5.000000
75%      7.000000
max      9.000000
dtype: float64


**s.mean():** Computes the mean of the data.

In [345]:
# Mean of the Series
print(s.mean())

5.0


**s.std():** Computes the standard deviation.

In [346]:
# Standard deviation of the Series
print(s.std())

3.1622776601683795


**s.min() and s.max():** Computes the minimum and maximum values.

In [347]:
# Minimum and maximum values
print(s.min())
print(s.max())

1
9


**s.sort_values():** Sorts the Series.

In [348]:
# Sorting the Series
sorted_s = s.sort_values()
print(sorted_s)

0    1
1    3
2    5
3    7
4    9
dtype: int64


**s.isnull():** Checks for missing values, returns a Series of booleans.

In [349]:
# Checking for missing values
print(s.isnull())

0    False
1    False
2    False
3    False
4    False
dtype: bool


**s.notnull():** Opposite of isnull().

In [350]:
# Checking for non-null values
print(s.notnull())

0    True
1    True
2    True
3    True
4    True
dtype: bool


**s.fillna(value):** Fills missing values with a specified value.

In [351]:
import numpy as np

In [352]:
# Create a Series with missing values
s = pd.Series([1, 2, np.nan, 4, np.nan])

# Print the Series
print(s)

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64


In [353]:
# Filling missing values with 0
filled = s.fillna(9)
print(filled)

0    1.0
1    2.0
2    9.0
3    4.0
4    9.0
dtype: float64


**s.dropna():** Drops all rows that contain missing values.

In [354]:
# Creating a Series with missing values
s_with_missing = pd.Series([1, 2, None, 4, 5])

# Dropping missing values
dropped_missing = s_with_missing.dropna()
print(dropped_missing)

0    1.0
1    2.0
3    4.0
4    5.0
dtype: float64


## Aggregation

In [355]:
# Sum of the Series
print(s.sum())

7.0


In [356]:
# Cumulative sum of the Series
print(s)
print(s.cumsum())

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64
0    1.0
1    3.0
2    NaN
3    7.0
4    NaN
dtype: float64


In [357]:
# Aggregating using multiple operations
aggregated = s.aggregate(['sum', 'mean', 'std'])
print(aggregated)

sum     7.000000
mean    2.333333
std     1.527525
dtype: float64


## Creating Data Frame

In [358]:
data = {
    'Students': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank'],
    'Math': [78, 85, 62, 90, 88, 76],
    'English': [84, 79, 91, 75, 89, 80]
}

df = pd.DataFrame(data)
print(df)

  Students  Math  English
0    Alice    78       84
1      Bob    85       79
2  Charlie    62       91
3    David    90       75
4      Eva    88       89
5    Frank    76       80


**Analyzing Data**
You can easily compute summary statistics, like the mean score for each subject:

In [359]:
average_scores = df[['Math', 'English']].mean()
print("\nAverage Scores:\n", average_scores)


Average Scores:
 Math       79.833333
English    83.000000
dtype: float64


**Data Selection**
You can select specific columns or rows. For example, to get the Math scores:

In [360]:
math_scores = df['Math']
print("Math Scores:\n", math_scores)

Math Scores:
 0    78
1    85
2    62
3    90
4    88
5    76
Name: Math, dtype: int64


You can add a new column with the total scores:

In [361]:
df['Total'] = df['Math'] + df['English']
print(df)

  Students  Math  English  Total
0    Alice    78       84    162
1      Bob    85       79    164
2  Charlie    62       91    153
3    David    90       75    165
4      Eva    88       89    177
5    Frank    76       80    156


**`df.head()` displays the first 5 rows of the DataFrame by default. You can pass a different number to head() if you want to see more or fewer rows. For example, df.head(3) would display the first 3 rows.**

In [362]:
# Display the first 5 rows of the DataFrame
print(df.head())

  Students  Math  English  Total
0    Alice    78       84    162
1      Bob    85       79    164
2  Charlie    62       91    153
3    David    90       75    165
4      Eva    88       89    177


# **Importing Dataset**

Importing datasets into Pandas is straightforward, and Pandas supports various file formats like csv, xlsx, json, sql etc.

In [363]:
df = pd.read_csv('Bengaluru_House_Data.csv')

df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [364]:
df.tail(10)  # Displays the last 10 rows of the DataFrame

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
13310,Super built-up Area,Ready To Move,Rachenahalli,2 BHK,,1050,2.0,2.0,52.71
13311,Plot Area,Ready To Move,Ramamurthy Nagar,7 Bedroom,,1500,9.0,2.0,250.0
13312,Super built-up Area,Ready To Move,Bellandur,2 BHK,,1262,2.0,2.0,47.0
13313,Super built-up Area,Ready To Move,Uttarahalli,3 BHK,Aklia R,1345,2.0,1.0,57.0
13314,Super built-up Area,Ready To Move,Green Glen Layout,3 BHK,SoosePr,1715,3.0,3.0,112.0
13315,Built-up Area,Ready To Move,Whitefield,5 Bedroom,ArsiaEx,3453,4.0,0.0,231.0
13316,Super built-up Area,Ready To Move,Richards Town,4 BHK,,3600,5.0,,400.0
13317,Built-up Area,Ready To Move,Raja Rajeshwari Nagar,2 BHK,Mahla T,1141,2.0,1.0,60.0
13318,Super built-up Area,18-Jun,Padmanabhanagar,4 BHK,SollyCl,4689,4.0,1.0,488.0
13319,Super built-up Area,Ready To Move,Doddathoguru,1 BHK,,550,1.0,1.0,17.0


**df.shape:**
The df.shape attribute of a DataFrame returns a tuple representing the dimensionality of the DataFrame. The first element of the tuple is the number of rows, and the second is the number of columns. This is useful when you need to know how large the dataset is, such as when you are preprocessing data or ensuring that data manipulations have executed correctly.

In [365]:
df.shape  # Outputs: (number of rows, number of columns)

(13320, 9)

**df.columns:**
The df.columns attribute returns an Index object containing the column labels of the DataFrame. Knowing the column names is essential for accessing specific data in the DataFrame, performing analyses, and for data manipulation tasks like sorting, filtering, or applying functions to certain columns.

In [366]:
df.columns  # Lists all the column names in the DataFrame

Index(['area_type', 'availability', 'location', 'size', 'society',
       'total_sqft', 'bath', 'balcony', 'price'],
      dtype='object')

**Inspecting Data Types:** Each column in a DataFrame has a specific data type. Understanding these types is crucial for proper data manipulation

In [367]:
# Display the data types of each column
df.dtypes

area_type        object
availability     object
location         object
size             object
society          object
total_sqft       object
bath            float64
balcony         float64
price           float64
dtype: object

**Summary Statistics:** For numerical data, it's useful to get a sense of their central tendency and spread

In [368]:
# Display summary statistics for numerical columns
df.describe()

Unnamed: 0,bath,balcony,price
count,13247.0,12711.0,13320.0
mean,2.69261,1.584376,112.565627
std,1.341458,0.817263,148.971674
min,1.0,0.0,8.0
25%,2.0,1.0,50.0
50%,2.0,2.0,72.0
75%,3.0,2.0,120.0
max,40.0,3.0,3600.0


## Accessing and Filtering:

**df.loc:**
The df.loc method is used for label-based indexing, meaning you can access rows and columns using their labels (i.e., index names and column names). It allows for selecting a subset of rows and columns from a DataFrame with powerful and flexible slicing, indexing, and filtering options.

In [369]:
location = df.loc[:, "location"]
location

0        Electronic City Phase II
1                Chikka Tirupathi
2                     Uttarahalli
3              Lingadheeranahalli
4                        Kothanur
                   ...           
13315                  Whitefield
13316               Richards Town
13317       Raja Rajeshwari Nagar
13318             Padmanabhanagar
13319                Doddathoguru
Name: location, Length: 13320, dtype: object

In [370]:
# Selecting a range of rows and multiple columns by labels
subset = df.loc[10:20, ['area_type', 'location', 'price']]
subset

Unnamed: 0,area_type,location,price
10,Super built-up Area,Whitefield,70.0
11,Plot Area,Whitefield,295.0
12,Super built-up Area,7th Phase JP Nagar,38.0
13,Built-up Area,Gottigere,40.0
14,Plot Area,Sarjapur,148.0
15,Super built-up Area,Mysore Road,73.5
16,Super built-up Area,Bisuvanahalli,48.0
17,Super built-up Area,Raja Rajeshwari Nagar,60.0
18,Super built-up Area,Ramakrishnappa Layout,290.0
19,Super built-up Area,Manayata Tech Park,48.0


In [371]:
# Conditional selection using a boolean array
condition = df.loc[df['size'] == '2 BHK']
condition

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.00
5,Super built-up Area,Ready To Move,Whitefield,2 BHK,DuenaTa,1170,2.0,1.0,38.00
12,Super built-up Area,Ready To Move,7th Phase JP Nagar,2 BHK,Shncyes,1000,2.0,1.0,38.00
13,Built-up Area,Ready To Move,Gottigere,2 BHK,,1100,2.0,2.0,40.00
...,...,...,...,...,...,...,...,...,...
13302,Super built-up Area,Ready To Move,Annaiah Reddy Layout,2 BHK,,1075,2.0,2.0,48.00
13304,Super built-up Area,Ready To Move,Raja Rajeshwari Nagar,2 BHK,GrrvaGr,1187,2.0,2.0,40.14
13310,Super built-up Area,Ready To Move,Rachenahalli,2 BHK,,1050,2.0,2.0,52.71
13312,Super built-up Area,Ready To Move,Bellandur,2 BHK,,1262,2.0,2.0,47.00


In [372]:
multiple_condition = df.loc[(df['size'] == '2 BHK') & (df['price']>50.0)]
multiple_condition

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.00
15,Super built-up Area,Ready To Move,Mysore Road,2 BHK,PrntaEn,1175,2.0,2.0,73.50
40,Built-up Area,Ready To Move,Murugeshpalya,2 BHK,Gentson,1296,2.0,,81.00
44,Super built-up Area,19-Sep,Kanakpura Road,2 BHK,Soazak,1330.74,2.0,2.0,91.79
47,Super built-up Area,20-Sep,Whitefield,2 BHK,Goted U,1459,2.0,1.0,94.82
...,...,...,...,...,...,...,...,...,...
13296,Super built-up Area,Ready To Move,Cox Town,2 BHK,,1200,2.0,2.0,140.00
13297,Super built-up Area,Ready To Move,Electronic City,2 BHK,GMown E,1060,2.0,1.0,52.00
13298,Super built-up Area,Ready To Move,Kenchenahalli,2 BHK,AriosPa,1015,2.0,2.0,60.00
13310,Super built-up Area,Ready To Move,Rachenahalli,2 BHK,,1050,2.0,2.0,52.71


## Updating Rows and Columns

**df.drop:**
The .drop() method in pandas is used to remove rows or columns from a DataFrame. Its primary purpose is to drop specified labels from rows or columns.

**Parameters:**

**labels:** The row or column labels to drop.

**axis:** Specifies whether the labels refer to rows (axis=0) or columns (axis=1). By default, it's 0 (rows).

**index or columns:** An alternative way to specify the labels to drop, instead of using the labels parameter. It is equivalent to specifying axis=0 (for index) or axis=1 (for columns).

**inplace:** If True, the operation is done in place, meaning it modifies the DataFrame directly and returns None. If False or not specified, it returns a new DataFrame with the specified labels dropped.

In [373]:
df.drop(labels='area_type',axis=1)

Unnamed: 0,availability,location,size,society,total_sqft,bath,balcony,price
0,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.00
2,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.00
3,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.00
4,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.00
...,...,...,...,...,...,...,...,...
13315,Ready To Move,Whitefield,5 Bedroom,ArsiaEx,3453,4.0,0.0,231.00
13316,Ready To Move,Richards Town,4 BHK,,3600,5.0,,400.00
13317,Ready To Move,Raja Rajeshwari Nagar,2 BHK,Mahla T,1141,2.0,1.0,60.00
13318,18-Jun,Padmanabhanagar,4 BHK,SollyCl,4689,4.0,1.0,488.00


**Direct Assignment:**
Directly assign a value to a specific column or even a cell in a DataFrame.

In [374]:
df.at[0, 'price'] = 40.0  # Changes the year of the first movie to 1983
df.head(5)

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,40.0
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [375]:
df['new_column'] = 'default value'  # Adds a new column with all entries set to 'default value'
df

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price,new_column
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,40.0,default value
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0,default value
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0,default value
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0,default value
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0,default value
...,...,...,...,...,...,...,...,...,...,...
13315,Built-up Area,Ready To Move,Whitefield,5 Bedroom,ArsiaEx,3453,4.0,0.0,231.0,default value
13316,Super built-up Area,Ready To Move,Richards Town,4 BHK,,3600,5.0,,400.0,default value
13317,Built-up Area,Ready To Move,Raja Rajeshwari Nagar,2 BHK,Mahla T,1141,2.0,1.0,60.0,default value
13318,Super built-up Area,18-Jun,Padmanabhanagar,4 BHK,SollyCl,4689,4.0,1.0,488.0,default value


`inplace=True`: **Modifies the DataFrame in place, without returning a new DataFrame.**

In [376]:
df.drop(axis=1,labels='new_column',inplace=True)

**Using apply Function:**
The apply function allows you to apply a function along an axis of the DataFrame.

In [377]:
df['price_category'] = df['price'].apply(lambda x: 'Costly' if x > 50 else 'Cheaper')
df

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price,price_category
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,40.0,Cheaper
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0,Costly
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0,Costly
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0,Costly
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0,Costly
...,...,...,...,...,...,...,...,...,...,...
13315,Built-up Area,Ready To Move,Whitefield,5 Bedroom,ArsiaEx,3453,4.0,0.0,231.0,Costly
13316,Super built-up Area,Ready To Move,Richards Town,4 BHK,,3600,5.0,,400.0,Costly
13317,Built-up Area,Ready To Move,Raja Rajeshwari Nagar,2 BHK,Mahla T,1141,2.0,1.0,60.0,Costly
13318,Super built-up Area,18-Jun,Padmanabhanagar,4 BHK,SollyCl,4689,4.0,1.0,488.0,Costly
