### What is pandas
Pandas is an open-source Python library that provides powerful and easy-to-use data analysis tools. It is built on top of the NumPy library and allows users to work with structured data in a way that is similar to working with spreadsheets or SQL tables.
Pandas provides two primary data structures: Series and DataFrame. A Series is a one-dimensional array-like object that can hold any data type. A DataFrame is a two-dimensional table-like data structure that can hold multiple Series. Pandas also provides functionality for data manipulation, data cleaning, merging and joining of datasets, time series analysis, and more.
Pandas is widely used in data science, machine learning, finance, and other domains where data analysis and manipulation are important.

In [None]:
Comma-separated values (CSV)

XLSX

ZIP

Plain Text (txt)

JSON

XML

HTML

Images

Hierarchical Data Format

PDF

DOCX

MP3

MP4

SQL

#### why we use pandas in data science
Pandas is a popular choice for data science because it provides a powerful set of tools for data analysis and manipulation in Python. Here are some of the key reasons why pandas is widely used in data science:            
1. Easy handling of tabular data: Pandas makes it easy to work with structured data such as tables or spreadsheets, which is common in data science.            

2. Powerful data manipulation tools: Pandas provides many tools for manipulating data such as filtering, sorting, grouping, merging, and reshaping. These tools allow data scientists to perform complex data transformations quickly and easily.                

3. Integration with other data science libraries: Pandas integrates well with other popular Python libraries used in data science such as NumPy, Matplotlib, and Scikit-learn. This makes it easy to use pandas alongside these libraries in data analysis workflows.        

4. Time series analysis: Pandas has excellent support for time series analysis, including tools for working with dates and times, resampling, and time zone handling.                   

5. Missing data handling: Pandas provides flexible tools for handling missing data, including methods for filling in missing data or dropping incomplete rows.      

### installation of pandas
pip install pandas

In [1]:
import pandas as pd
import numpy as np

### pandas provide 2 type of data structure
1. Series
2. DataFrame

### 1. series
Pandas is a popular library in Python for data manipulation and analysis. It provides a data structure called Series, which is a one-dimensional labeled array capable of holding any data type. You can think of a Series as a column in a spreadsheet or a single column of data.

### creating a series

In [11]:
#creating empty series
import pandas as pd
import numpy as np
# ser=pd.Series()
# print(ser)
# print()
# create array series
data=np.array(["s","o","u","r","a","v"])
ser=pd.Series(data)
ser

0    s
1    o
2    u
3    r
4    a
5    v
dtype: object

In [6]:
#simple array
data=np.array(["s","o","u","r","a","v"])
#providing an index
ser=pd.Series(data,index=[11,12,13,14,15,16])
print(ser)

11    s
12    o
13    u
14    r
15    a
16    v
dtype: object


In [6]:
# a simple dictonary
dict=  {'lucky':20,'sourav':30,'atul':89}
ser=pd.Series(dict)
print(ser)

lucky     20
sourav    30
atul      89
dtype: int64


In [12]:
#giving scaler value with index
ser=pd.Series('swini',index=[0,1,2,3,4])
print(ser)

0    swini
1    swini
2    swini
3    swini
4    swini
dtype: object


In [22]:
import pandas as pd
import numpy as np
#series with numpy linspace()
ser1=pd.Series(np.linspace(2,33,4))
print(ser1)
# var=np.linspace(2,33,4,dtype="i")

0     2.000000
1    12.333333
2    22.666667
3    33.000000
dtype: float64


In [24]:
import pandas as pd
import numpy as np

# Series with numpy linspace()
ser1 = pd.Series(np.linspace(2, 33, 4).astype(int))
print(ser1)


0     2
1    12
2    22
3    33
dtype: int32


In [24]:
for x in 'abcdefg':
    print(x)

a
b
c
d
e
f
g


In [4]:
#creating a series using for loop and list comprehesion
import numpy as np
import pandas as pd
ser=pd.Series(range(1,20,3),index=[x for x in 'abcdefg'])
print(ser)

a     1
b     4
c     7
d    10
e    13
f    16
g    19
dtype: int64


In [5]:
#creating a series using range function
ser=pd.Series(range(10))
print(ser)

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64


In [4]:
for i in range(1,10):
    print(i)


1
2
3
4
5
6
7
8
9


### Accessing  elements of Series

there are two ways tohrough whoch we can access element of series,
1. accessing element from series with position
2. accessing element using label (index)

In [25]:
import pandas as pd

data = [10, 20, 30, 40, 50]
ser = pd.Series(data)
# print(ser)

# Access element at position 2 using indexing operator []
element = ser[2]
print(element)
elements=ser[3]
print(elements)


30
40


In [10]:
l=[7,8,9,6,7]
#    0 1  2 3 4
# print(l[start:end:step])
print(l[0:3])

[7, 8, 9]


In [11]:
data = [10, 20, 30, 40, 50]
ser = pd.Series(data)

# retrive ist two element
element = ser[3:5]
print(element)


3    40
4    50
dtype: int64


In [13]:
data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index)
print(series)
# Access element with label 'c' using indexing operator []
element = series['d']
print(element)

a    10
b    20
c    30
d    40
e    50
dtype: int64
40


In [33]:
# making data frame using csv file
import pandas as pd
df=pd.read_csv("data.csv")
# print(df.head())
ser=pd.Series(df['Industry_code_NZSIOC'])
print(ser)
data1=ser.head()
print(data1)

0        99999
1        99999
2        99999
3        99999
4        99999
         ...  
41710     ZZ11
41711     ZZ11
41712     ZZ11
41713     ZZ11
41714     ZZ11
Name: Industry_code_NZSIOC, Length: 41715, dtype: object
0    99999
1    99999
2    99999
3    99999
4    99999
Name: Industry_code_NZSIOC, dtype: object


In [39]:
df=pd.read_csv("data.csv")

# df
ser=pd.Series(df['Industry_code_NZSIOC'])
data1=df[3:5]
print(data1)

   Year Industry_aggregation_NZSIOC Industry_code_NZSIOC Industry_name_NZSIOC  \
3  2021                     Level 1                99999       All industries   
4  2021                     Level 1                99999       All industries   

                Units Variable_code         Variable_name  \
3  Dollars (millions)           H07  Non-operating income   
4  Dollars (millions)           H08     Total expenditure   

       Variable_category    Value  \
3  Financial performance   33,020   
4  Financial performance  654,404   

                              Industry_code_ANZSIC06  
3  ANZSIC06 divisions A-S (excluding classes K633...  
4  ANZSIC06 divisions A-S (excluding classes K633...  


In [41]:
#using .loc[] function
print(df.loc[4:9])


   Year Industry_aggregation_NZSIOC Industry_code_NZSIOC Industry_name_NZSIOC  \
4  2021                     Level 1                99999       All industries   
5  2021                     Level 1                99999       All industries   
6  2021                     Level 1                99999       All industries   
7  2021                     Level 1                99999       All industries   
8  2021                     Level 1                99999       All industries   
9  2021                     Level 1                99999       All industries   

                Units Variable_code             Variable_name  \
4  Dollars (millions)           H08         Total expenditure   
5  Dollars (millions)           H09    Interest and donations   
6  Dollars (millions)           H10            Indirect taxes   
7  Dollars (millions)           H11              Depreciation   
8  Dollars (millions)           H12   Salaries and wages paid   
9  Dollars (millions)           H13  Redun

In [18]:
#using .iloc[] function
# print(df.loc[])
print(df.iloc[4:9])

   Year Industry_aggregation_NZSIOC Industry_code_NZSIOC Industry_name_NZSIOC  \
4  2021                     Level 1                99999       All industries   
5  2021                     Level 1                99999       All industries   
6  2021                     Level 1                99999       All industries   
7  2021                     Level 1                99999       All industries   
8  2021                     Level 1                99999       All industries   

                Units Variable_code            Variable_name  \
4  Dollars (millions)           H08        Total expenditure   
5  Dollars (millions)           H09   Interest and donations   
6  Dollars (millions)           H10           Indirect taxes   
7  Dollars (millions)           H11             Depreciation   
8  Dollars (millions)           H12  Salaries and wages paid   

       Variable_category    Value  \
4  Financial performance  654,404   
5  Financial performance   26,138   
6  Financial perf

## Data frame
 a data frame is a two-dimensional tabular data structure that is similar to a spreadsheet or a SQL table. It is one of the core data structures provided by pandas and is widely used for data analysis and manipulation.

The correct syntax for creating a data frame using the DataFrame() function in pandas.

"pandas.DataFrame(data, index, columns)"

Here's a breakdown of the parameters:

1. data: This is the dataset from which the data frame is created. It can be a list, dictionary, scalar value, series, NumPy arrays, etc. The data is organized in a tabular format, with rows representing observations and columns representing variables.

2. index (optional): By default, the index of the data frame starts from 0 and ends at the last data value (n-1). However, you can explicitly define the row labels by specifying the index parameter. It can be a list, array, or any other sequence that matches the length of the data.

3. columns (optional): This parameter is used to provide column names in the data frame. If column names are not defined, the data frame will assign default column names as integers from 0 to n-1, based on the number of columns in the data. You can specify the columns parameter as a list, array, or any other sequence that matches the length of the data.

In [44]:
#create a empty data set
import pandas as pd
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


In [43]:
# print  dataframe by using list
index = ['a', 'b', 'c', 'd', 'e']
df = pd.DataFrame(index)
print(df)

   0
0  a
1  b
2  c
3  d
4  e


In [14]:
#dictonary
dict = {'name':['a', 'b', 'c', 'd', 'e'],'age':[10, 20, 30, 40, 50],'city':["chd","chd","mohali","delhi","chd"]}
df = pd.DataFrame(dict)
print(df)

  name  age    city
0    a   10     chd
1    b   20     chd
2    c   30  mohali
3    d   40   delhi
4    e   50     chd


In [46]:
# zip function : combine two list in form of tupple
import pandas as pd

age = [10, 20, 30, 40, 50]
name = ['a', 'b', 'c', 'd', 'e']
list_of_tuples = list(zip(name, age))

df = pd.DataFrame(list_of_tuples,index=[11,12,13,14,15],columns=['name', 'age'])
print(df)


   name  age
11    a   10
12    b   20
13    c   30
14    d   40
15    e   50


In [20]:
# initiailze dat to dicts of series
d={'one': pd.Series([10,20,30,40,50], index=['a', 'b', 'c', 'd', 'e']),
   'two':pd.Series([60,70,80,90,100], index=['f', 'g', 'h', 'i', 'j']),
  'three': pd.Series([10,20,30,40,50], index=['a', 'b', 'c', 'd', 'e']),
   'four':pd.Series([60,70,80,90,100], index=['f', 'g', 'h', 'i', 'j'])
  }
df=pd.DataFrame(d)
print(df)

    one    two  three   four
a  10.0    NaN   10.0    NaN
b  20.0    NaN   20.0    NaN
c  30.0    NaN   30.0    NaN
d  40.0    NaN   40.0    NaN
e  50.0    NaN   50.0    NaN
f   NaN   60.0    NaN   60.0
g   NaN   70.0    NaN   70.0
h   NaN   80.0    NaN   80.0
i   NaN   90.0    NaN   90.0
j   NaN  100.0    NaN  100.0


### dealing with rows and columns

In [19]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry', 'Ivy', 'Jack'],
    'Age': [25, 30, 35, 28, 32, 27, 31, 29, 34, 26],
    'Qualification': ['Bachelor', 'Master', 'PhD', 'Master', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master', 'PhD'],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male'],
    'City': ['New York', 'London', 'Tokyo', 'Paris', 'Berlin', 'Sydney', 'Toronto', 'Mumbai', 'Singapore', 'Dubai'],
    'Country': ['USA', 'UK', 'Japan', 'France', 'Germany', 'Australia', 'Canada', 'India', 'Singapore', 'UAE']
}

df = pd.DataFrame(data)
print(df)
print(df[['Name','Gender','Country']])

      Name  Age Qualification  Gender       City    Country
0    Alice   25      Bachelor  Female   New York        USA
1      Bob   30        Master    Male     London         UK
2  Charlie   35           PhD    Male      Tokyo      Japan
3    David   28        Master    Male      Paris     France
4      Eva   32      Bachelor  Female     Berlin    Germany
5    Frank   27        Master    Male     Sydney  Australia
6    Grace   31           PhD  Female    Toronto     Canada
7    Henry   29      Bachelor    Male     Mumbai      India
8      Ivy   34        Master  Female  Singapore  Singapore
9     Jack   26           PhD    Male      Dubai        UAE
      Name  Gender    Country
0    Alice  Female        USA
1      Bob    Male         UK
2  Charlie    Male      Japan
3    David    Male     France
4      Eva  Female    Germany
5    Frank    Male  Australia
6    Grace  Female     Canada
7    Henry    Male      India
8      Ivy  Female  Singapore
9     Jack    Male        UAE


In [21]:
df

Unnamed: 0,Name,Age,Qualification,Gender,City,Country
0,Alice,25,Bachelor,Female,New York,USA
1,Bob,30,Master,Male,London,UK
2,Charlie,35,PhD,Male,Tokyo,Japan
3,David,28,Master,Male,Paris,France
4,Eva,32,Bachelor,Female,Berlin,Germany
5,Frank,27,Master,Male,Sydney,Australia
6,Grace,31,PhD,Female,Toronto,Canada
7,Henry,29,Bachelor,Male,Mumbai,India
8,Ivy,34,Master,Female,Singapore,Singapore
9,Jack,26,PhD,Male,Dubai,UAE


In [18]:
#asscess data from json file
import pandas as pd
import numpy as np
data=pd.read_json("database.json")
df=data["Total population"]
print(df)

0       3401198.0
1       3073734.0
2       3093465.0
3       3111162.0
4       3127264.0
          ...    
879    29774448.0
880    30243172.0
881    30757669.0
882    31298929.0
883           NaN
Name: Total population, Length: 884, dtype: float64


### working with missing data :
Working with missing data in pandas involves handling and managing the absence of values or NaN (Not a Number) values in a data frame. Pandas provides several methods and functions to handle missing data effectively.

Here are some common operations for working with missing data in pandas:

1. Detecting missing values:

isnull() and notnull(): These functions return a Boolean mask indicating which values are missing or not missing, respectively, in the data frame or a specific column.

any() and all(): These functions can be used with isnull() to check if any or all values in a column or data frame are missing.

2. Dropping missing values:

dropna(): This function is used to drop rows or columns that contain missing values. It allows you to specify the axis (rows or columns) and other parameters to control the behavior of dropping.

3. Filling missing values:

fillna(): This function is used to fill missing values with a specified value or a predefined strategy. It can fill missing values forward or backward (using the previous or next non-missing value), or fill with a constant value, mean, median, or mode of the column.

4. Replacing missing values:

replace(): This function can be used to replace specific values, including missing values, with other values of your choice.
 
5. Interpolating missing values:

interpolate(): This function is used to interpolate missing values based on various interpolation methods such as linear, quadratic, cubic, etc. It estimates missing values based on existing values in a column or data frame.

6. Counting missing values:

isnull().sum(): This expression returns the count of missing values in each column of the data frame.

count(): This function returns the count of non-missing values in each column.

In [20]:
#Detecting missing values:
data = {'A': [1, 2, np.nan, 4],
        'B': [5, np.nan, 7, 8],
        'C': [np.nan, 10, 11, 12]}

df = pd.DataFrame(data)
print(df)
# Detect missing values
print(df.isnull())
# print("***************************")
# print("***************************")
# Check if any values are missing in each column
print(df.isnull().any())
# print("***************************")
# print("***************************")
# Check if all values are missing in each column
print(df.isnull().all())

     A    B     C
0  1.0  5.0   NaN
1  2.0  NaN  10.0
2  NaN  7.0  11.0
3  4.0  8.0  12.0
       A      B      C
0  False  False   True
1  False   True  False
2   True  False  False
3  False  False  False
A    True
B    True
C    True
dtype: bool
A    False
B    False
C    False
dtype: bool


In [21]:
#Dropping missing values

# Drop rows with missing values
df_dropped_rows = df.dropna()
print(df_dropped_rows)
print("***************************")
print("***************************")
# Drop columns with missing values
df_dropped_columns = df.dropna(axis=1)
print(df_dropped_columns)

     A    B     C
3  4.0  8.0  12.0
***************************
***************************
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]


In [20]:
#Filling missing values

# Fill missing values with a constant value
df_filled_constant = df.fillna(8)
print(df)
print(df_filled_constant)
print("***************************")
print("***************************")
# Fill missing values with the mean of the column
df_filled_mean = df.fillna(df.mean())
print(df_filled_mean)
print("***************************")
print("***************************")
# Fill missing values forward (use the previous non-missing value)
# df_filled_forward = df.fillna(method='bfill')
df_filled_forward = df.fillna(method='ffill')

print(df_filled_forward)


     A    B     C
0  1.0  5.0   NaN
1  2.0  NaN  10.0
2  NaN  7.0  11.0
3  4.0  8.0  12.0
     A    B     C
0  1.0  5.0   8.0
1  2.0  8.0  10.0
2  8.0  7.0  11.0
3  4.0  8.0  12.0
***************************
***************************
          A         B     C
0  1.000000  5.000000  11.0
1  2.000000  6.666667  10.0
2  2.333333  7.000000  11.0
3  4.000000  8.000000  12.0
***************************
***************************
     A    B     C
0  1.0  5.0   NaN
1  2.0  5.0  10.0
2  2.0  7.0  11.0
3  4.0  8.0  12.0


In [23]:
# Replace missing values
# Replace missing values with a specific value
df_replaced = df.replace(np.nan, -1)
print(df_replaced)


     A    B     C
0  1.0  5.0  -1.0
1  2.0 -1.0  10.0
2 -1.0  7.0  11.0
3  4.0  8.0  12.0


In [24]:
#Apologies for the confusion
# Interpolate missing values
print(df)
df_interpolated = df.interpolate()
print(df_interpolated)

     A    B     C
0  1.0  5.0   NaN
1  2.0  NaN  10.0
2  NaN  7.0  11.0
3  4.0  8.0  12.0
     A    B     C
0  1.0  5.0   NaN
1  2.0  6.0  10.0
2  3.0  7.0  11.0
3  4.0  8.0  12.0


In [28]:
#Counting missing values:
# Count missing values in each column using isnull().sum()
missing_count = df.isnull().sum()
print("Count of missing values using isnull().sum():")
print(missing_count)
print("***************************")
print("***************************")
# Count non-missing values in each column using count()
non_missing_count = df.count()
print("\nCount of non-missing values using count():")
print(non_missing_count)

Count of missing values using isnull().sum():
A    1
B    1
C    1
dtype: int64
***************************
***************************

Count of non-missing values using count():
A    3
B    3
C    3
dtype: int64


#### Iterating over rows and columns



In [7]:
print("Iterating over rows using iterrows")
print("***************************")
data = {'A': [1, 2, 3],
        'B': [4, 5, 6],
        'C': [7, 8, 9]}

df = pd.DataFrame(data)
print(df)
# Iterating over rows using iterrows()
for index, row in df.iterrows():
    print(f"Row index: {index}")
    print(f"Row values: {row}\n")

Iterating over rows using iterrows
***************************
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
Row index: 0
Row values: A    1
B    4
C    7
Name: 0, dtype: int64

Row index: 1
Row values: A    2
B    5
C    8
Name: 1, dtype: int64

Row index: 2
Row values: A    3
B    6
C    9
Name: 2, dtype: int64



In [30]:
print("Iterating over columns using iteritems")
print("***************************")

data = {'A': [1, 2, 3],
        'B': [4, 5, 6],
        'C': [7, 8, 9]}

df = pd.DataFrame(data)

# Iterating over columns using items()
for column_name, column_data in df.items():
    print(f"Column name: {column_name}")
    print(f"Column values: {column_data.values}\n")

Iterating over columns using iteritems
***************************
Column name: A
Column values: [1 2 3]

Column name: B
Column values: [4 5 6]

Column name: C
Column values: [7 8 9]



### Pandas Dataframe/Series.head() method
The head() method in pandas is used to return the first n rows of a DataFrame or Series. It provides a convenient way to quickly inspect the beginning of a dataset or a subset of it.

In [10]:
#in normal way
data = {'Name': ['John', 'Alice', 'Bob', 'Jane', 'Mike'],
        'Age': [25, 30, 35, 28, 32],
        'Country': ['USA', 'Canada', 'UK', 'Australia', 'Germany']}

df = pd.DataFrame(data)
print(df)

# Display the first 3 rows of the DataFrame
print(df.head(2))

    Name  Age    Country
0   John   25        USA
1  Alice   30     Canada
2    Bob   35         UK
3   Jane   28  Australia
4   Mike   32    Germany
    Name  Age Country
0   John   25     USA
1  Alice   30  Canada


In [32]:
# by csv file
data=pd.read_csv("data.csv")
df = pd.DataFrame(data)
print(df.head(3))

   Year Industry_aggregation_NZSIOC Industry_code_NZSIOC Industry_name_NZSIOC   
0  2021                     Level 1                99999       All industries  \
1  2021                     Level 1                99999       All industries   
2  2021                     Level 1                99999       All industries   

                Units Variable_code   
0  Dollars (millions)           H01  \
1  Dollars (millions)           H04   
2  Dollars (millions)           H05   

                                     Variable_name      Variable_category   
0                                     Total income  Financial performance  \
1  Sales, government funding, grants and subsidies  Financial performance   
2                Interest, dividends and donations  Financial performance   

     Value                             Industry_code_ANZSIC06  
0  757,504  ANZSIC06 divisions A-S (excluding classes K633...  
1  674,890  ANZSIC06 divisions A-S (excluding classes K633...  
2   49,593  ANZSI

In [33]:
# by json file
data=pd.read_json("database.json")
df = pd.DataFrame(data)
print(df.head(3))

   Country  Year  Area (square kilometres)  Total population   
0  Albania  2000                   28748.0         3401198.0  \
1  Albania  2001                   28748.0         3073734.0   
2  Albania  2002                   28748.0         3093465.0   

   Population density, pers. per sq. km  Population aged 0-14, male   
0                             118.31077                    564851.0  \
1                             106.91992                    460732.0   
2                             107.60627                    452373.0   

   Population aged 0-14, female  Population aged 15-64, male   
0                      529820.0                    1029446.0  \
1                      436652.0                     963612.0   
2                      427711.0                     977294.0   

   Population aged 15-64, female  Population aged 64+, male  ...   
0                      1086946.0                    82588.0  ...  \
1                       980800.0                   108254.0  ... 

### Dataframe/Series.tail() method
The tail() method in pandas is used to return the last n rows of a DataFrame or Series. It is similar to the head() method, but it displays the end portion of the dataset.

In [26]:
# by normal way
data = {'Name': ['John', 'Alice', 'Bob', 'Jane', 'Mike'],
        'Age': [25, 30, 35, 28, 32],
        'Country': ['USA', 'Canada', 'UK', 'Australia', 'Germany']}

df = pd.DataFrame(data)

# Display the last 2 rows of the DataFrame
print(df.tail(3))

   Name  Age    Country
2   Bob   35         UK
3  Jane   28  Australia
4  Mike   32    Germany


In [35]:
series = pd.Series(data)

# Display the last 3 elements of the Series
print(series.tail(3))

Name              [John, Alice, Bob, Jane, Mike]
Age                         [25, 30, 35, 28, 32]
Country    [USA, Canada, UK, Australia, Germany]
dtype: object


In [13]:
#by json file
data=pd.read_json("database.json")
# print(data)
df = pd.DataFrame(data)
print(df.tail(3))

        Country  Year  Area (square kilometres)  Total population  \
881  Uzbekistan  2014                  447400.0        30757669.0   
882  Uzbekistan  2015                  447400.0        31298929.0   
883  Uzbekistan  2016                  447400.0               NaN   

     Population density, pers. per sq. km  Population aged 0-14, male  \
881                              68.74758                   4465526.0   
882                              69.95737                   4567154.0   
883                                   NaN                         NaN   

     Population aged 0-14, female  Population aged 15-64, male  \
881                     4201156.0                   10410214.0   
882                     4286838.0                   10566923.0   
883                           NaN                          NaN   

     Population aged 15-64, female  Population aged 64+, male  ...  \
881                     10451289.0                   539106.0  ...   
882                     1

In [14]:
# by csv file
data=pd.read_csv("data.csv")
df = pd.DataFrame(data)
print(df.tail(3))

       Year Industry_aggregation_NZSIOC Industry_code_NZSIOC  \
41712  2013                     Level 3                 ZZ11   
41713  2013                     Level 3                 ZZ11   
41714  2013                     Level 3                 ZZ11   

             Industry_name_NZSIOC       Units Variable_code  \
41712  Food product manufacturing  Percentage           H39   
41713  Food product manufacturing  Percentage           H40   
41714  Food product manufacturing  Percentage           H41   

                Variable_name Variable_category Value  \
41712        Return on equity  Financial ratios    12   
41713  Return on total assets  Financial ratios     5   
41714   Liabilities structure  Financial ratios    46   

                                  Industry_code_ANZSIC06  
41712  ANZSIC06 groups C111, C112, C113, C114, C115, ...  
41713  ANZSIC06 groups C111, C112, C113, C114, C115, ...  
41714  ANZSIC06 groups C111, C112, C113, C114, C115, ...  


### Pandas Dataframe.to_numpy()
to_numpy() method in pandas is used to convert a DataFrame to its underlying NumPy array representation. It returns a NumPy array containing the values from the DataFrame.

In [15]:
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'Country': ['USA', 'Canada', 'UK']}

df = pd.DataFrame(data)

# Convert DataFrame to NumPy array
numpy_array = df.to_numpy()

print(numpy_array)

[['John' 25 'USA']
 ['Alice' 30 'Canada']
 ['Bob' 35 'UK']]


In [16]:
#by csv file
data=pd.read_csv("data.csv")
df = pd.DataFrame(data)
numpy_array = df.to_numpy()
print(numpy_array)

[[2021 'Level 1' '99999' ... 'Financial performance' '757,504'
  'ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)']
 [2021 'Level 1' '99999' ... 'Financial performance' '674,890'
  'ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)']
 [2021 'Level 1' '99999' ... 'Financial performance' '49,593'
  'ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)']
 ...
 [2013 'Level 3' 'ZZ11' ... 'Financial ratios' '12'
  'ANZSIC06 groups C111, C112, C113, C114, C115, C116, C117, C118, and C119']
 [2013 'Level 3' 'ZZ11' ... 'Financial ratios' '5'
  'ANZSIC06 groups C111, C112, C113, C114, C115, C116, C117, C118, and C119']
 [2013 'Level 3' 'ZZ11' ... 'Financial ratios' '46'
  'ANZSIC06 groups C111, C112, C113, C114, C115, C116, C117, C118, and C119']]


In [17]:
# by json file
data=pd.read_json("database.json")
df = pd.DataFrame(data)
numpy_array = df.to_numpy()

print(numpy_array)

[['Albania' 2000 28748.0 ... 336.0 nan 400.0]
 ['Albania' 2001 28748.0 ... 250.0 nan 409.0]
 ['Albania' 2002 28748.0 ... 228.0 nan 428.0]
 ...
 ['Uzbekistan' 2014 447400.0 ... nan nan nan]
 ['Uzbekistan' 2015 447400.0 ... nan nan nan]
 ['Uzbekistan' 2016 447400.0 ... nan nan nan]]


#### Pandas Dataframe.describe() method
The describe() method in pandas is used to generate descriptive statistics of a DataFrame. It provides a summary of the central tendency, dispersion, and shape of the distribution of numerical columns in the DataFrame.


In [41]:
data = {'Name': ['John', 'Alice', 'Bob', 'Jane', 'Mike'],
        'Age': [25, 30, 35, 28, 32],
        'Salary': [50000, 60000, 70000, 55000, 65000]}

df = pd.DataFrame(data)

# Generate descriptive statistics of the DataFrame
description = df.describe()

print(description)

             Age       Salary
count   5.000000      5.00000
mean   30.000000  60000.00000
std     3.807887   7905.69415
min    25.000000  50000.00000
25%    28.000000  55000.00000
50%    30.000000  60000.00000
75%    32.000000  65000.00000
max    35.000000  70000.00000


In [18]:
# by json file
data=pd.read_json("database.json")
df = pd.DataFrame(data)
description = df.describe()
print(description)

              Year  Area (square kilometres)  Total population  \
count   884.000000              8.810000e+02      7.860000e+02   
mean   2008.000000              8.597709e+05      2.368150e+07   
std       4.901753              2.795082e+06      4.801203e+07   
min    2000.000000              3.160000e+02      2.812050e+05   
25%    2004.000000              4.152600e+04      3.615421e+06   
50%    2008.000000              8.836100e+04      7.501732e+06   
75%    2012.000000              3.237590e+05      1.996501e+07   
max    2016.000000              1.707540e+07      3.214188e+08   

       Population density, pers. per sq. km  Population aged 0-14, male  \
count                            786.000000                7.850000e+02   
mean                             122.029854                2.259789e+06   
std                              183.847053                4.782705e+06   
min                                2.730150                3.144650e+04   
25%                           

In [19]:
#by csv file
data=pd.read_csv("data.csv")
df = pd.DataFrame(data)
description = df.describe()

print(description)

              Year
count  41715.00000
mean    2017.00000
std        2.58202
min     2013.00000
25%     2015.00000
50%     2017.00000
75%     2019.00000
max     2021.00000


### Pandas Series.as_matrix()
The Series.as_matrix() method in Pandas is used to convert a Pandas series object into a NumPy array.

The method returns a one-dimensional NumPy array that contains the same data as the Pandas series. This can be useful for performing numerical computations using the powerful tools available in NumPy.

In [44]:
#create series
sr = pd.Series([1, 2, 3, 4, 5])
#create index
index_=['John', 'Alice', 'Bob', 'Jane', 'Mike']
sr.index=index_
print(sr)

John     1
Alice    2
Bob      3
Jane     4
Mike     5
dtype: int64


### pandas series to a NumPy

In [45]:
s = pd.Series([1, 2, 3, 4, 5])

# Convert the series to a NumPy array using to_numpy()
arr = s.to_numpy()

# Print the NumPy array
print(arr)
print("************")
print(list(arr))
print("************")
print(type(s.to_numpy()))
print("************")
print(s.to_numpy(dtype='float32'))
print("************")

[1 2 3 4 5]
************
[1, 2, 3, 4, 5]
************
<class 'numpy.ndarray'>
************
[1. 2. 3. 4. 5.]
************


In [20]:
# by csv file
data=pd.read_csv("data.csv")
data.dropna(inplace=True)
#creating series from the Year
arr=pd.Series(data['Year'].head())
print(arr)

0    2021
1    2021
2    2021
3    2021
4    2021
Name: Year, dtype: int64


In [21]:
# by json file
data=pd.read_json("database.json")
data.dropna(inplace=True)
#creating series from the Country
arr=pd.Series(data['Country'].head())
print(arr)

125    Bulgaria
126    Bulgaria
127    Bulgaria
163     Croatia
164     Croatia
Name: Country, dtype: object


 ### Dealing with Rows and Columns in Pandas DataFrame


Pandas is a powerful library in Python for data manipulation and analysis. It provides a DataFrame object, which is a two-dimensional labeled data structure similar to a table or spreadsheet. In Pandas, you can deal with rows and columns in a DataFrame using various operations.

#### Dealing with columns
Dealing with columns in a Pandas DataFrame involves various operations such as accessing columns, renaming columns, adding new columns, deleting columns, and performing calculations on columns. 

###### Accessing Columns:
To access a specific column in a DataFrame, you can use the indexing operator [] with the column name as a string. For example, if you have a DataFrame called df and you want to access the column "column_name", you can use df["column_name"]. This will return a Series object containing the values of that column.

In [28]:
#in simple way 
import pandas as pd

# Create a DataFrame
data = {
    "Name": ["John", "Emma", "Ryan"],
    "Age": [25, 28, 32],
    "City": ["New York", "Paris", "London"]
}
df = pd.DataFrame(data)

# Access the "age" column
name_column = df["Age"]
print(name_column)


0    25
1    28
2    32
Name: Age, dtype: int64


In [49]:
# Read the CSV file into a DataFrame
df = pd.read_csv("data.csv")

# Access a specific column
column_name = df["Year"]
print(column_name)

0        2021
1        2021
2        2021
3        2021
4        2021
         ... 
41710    2013
41711    2013
41712    2013
41713    2013
41714    2013
Name: Year, Length: 41715, dtype: int64


In [22]:
# Read the JSON file into a DataFrame
df = pd.read_json("database.json")

# Access a specific column
column_name = df["Country"]
print(column_name)

0         Albania
1         Albania
2         Albania
3         Albania
4         Albania
          ...    
879    Uzbekistan
880    Uzbekistan
881    Uzbekistan
882    Uzbekistan
883    Uzbekistan
Name: Country, Length: 884, dtype: object


#### Renaming Columns
To rename a column in a DataFrame, you can use the rename() method. It allows you to provide a dictionary where the keys represent the current column names, and the values represent the new column names. For example, df = df.rename(columns={"old_name": "new_name"}) will rename the column "old_name" to "new_name" in the DataFrame.

In [51]:
#in simple way 
# Create a DataFrame
data = {
    "Name": ["John", "Emma", "Ryan"],
    "Age": [25, 28, 32],
    "City": ["New York", "Paris", "London"]
}
df = pd.DataFrame(data)
# Rename the "Name" column to "Full Name"
df = df.rename(columns={"Name": "Full Name"})
print(df)

  Full Name  Age      City
0      John   25  New York
1      Emma   28     Paris
2      Ryan   32    London


In [23]:
# Read the CSV file into a DataFrame
df = pd.read_csv("data.csv")
# Rename the "Name" column to "Full Name"
df = df.rename(columns={"Year": "saal"})
print(df)

       saal Industry_aggregation_NZSIOC Industry_code_NZSIOC  \
0      2021                     Level 1                99999   
1      2021                     Level 1                99999   
2      2021                     Level 1                99999   
3      2021                     Level 1                99999   
4      2021                     Level 1                99999   
...     ...                         ...                  ...   
41710  2013                     Level 3                 ZZ11   
41711  2013                     Level 3                 ZZ11   
41712  2013                     Level 3                 ZZ11   
41713  2013                     Level 3                 ZZ11   
41714  2013                     Level 3                 ZZ11   

             Industry_name_NZSIOC               Units Variable_code  \
0                  All industries  Dollars (millions)           H01   
1                  All industries  Dollars (millions)           H04   
2                 

In [53]:
# Read the JSON file into a DataFrame
df = pd.read_json("database.json")
# Rename the "country" column to "desh"
df = df.rename(columns={"Country": "desh"})
print(df)

           desh  Year  Area (square kilometres)  Total population   
0       Albania  2000                   28748.0         3401198.0  \
1       Albania  2001                   28748.0         3073734.0   
2       Albania  2002                   28748.0         3093465.0   
3       Albania  2003                   28748.0         3111162.0   
4       Albania  2004                   28748.0         3127264.0   
..          ...   ...                       ...               ...   
879  Uzbekistan  2012                  447400.0        29774448.0   
880  Uzbekistan  2013                  447400.0        30243172.0   
881  Uzbekistan  2014                  447400.0        30757669.0   
882  Uzbekistan  2015                  447400.0        31298929.0   
883  Uzbekistan  2016                  447400.0               NaN   

     Population density, pers. per sq. km  Population aged 0-14, male   
0                               118.31077                    564851.0  \
1                        

#### Adding New Columns:
To add a new column to a DataFrame, you can assign a new Series or an array to a new column name. For example, df["new_column"] = [1, 2, 3, 4] will add a new column called "new_column" to the DataFrame and populate it with the provided values.

In [54]:
#in simple way 
# Create a DataFrame
data = {
    "Name": ["John", "Emma", "Ryan"],
    "Age": [25, 28, 32],
    "City": ["New York", "Paris", "London"]
}
df = pd.DataFrame(data)
# Add a new column "Salary"
df["Salary"] = [5000, 6000, 7000]
print(df)

   Name  Age      City  Salary
0  John   25  New York    5000
1  Emma   28     Paris    6000
2  Ryan   32    London    7000


### Deleting Columns
To delete a column from a DataFrame, you can use the drop() function. For example, df = df.drop("column_name", axis=1) will remove the column named "column_name" from the DataFrame. The axis=1 parameter specifies that you are dropping a column.

In [55]:
# Delete the "City" column
df = df.drop("City", axis=1)
print(df)

   Name  Age  Salary
0  John   25    5000
1  Emma   28    6000
2  Ryan   32    7000


### Calculations on Columns:
You can perform calculations on columns using arithmetic operators. For example, if you have two columns "column1" and "column2" in a DataFrame, you can create a new column "sum_column" that contains the sum of the two columns using the following syntax: df["sum_column"] = df["column1"] + df["column2"]. Similarly, you can use other arithmetic operators like subtraction (-), multiplication (*), and division (/) to perform calculations on columns.

In [24]:
# Create a new column "Total Salary" as the sum of "Salary" and 1000
df["Total Salary"] = df["Salary"] + 1000
print(df)

KeyError: 'Salary'

#### Applying Functions to Columns
Pandas provides the apply() method to apply a function to each element of a column. You can define a custom function and use apply() to apply it to a column. For example, if you have a column "column_name" and you want to apply a function called my_function to each element of that column, you can use the following syntax: df["column_name"] = df["column_name"].apply(my_function).

In [57]:
# Define a function to convert age to days
def age_to_days(age):
    return age * 365

# Apply the function to the "Age" column
df["Age in Days"] = df["Age"].apply(age_to_days)
print(df)


   Name  Age  Salary  Total Salary  Age in Days
0  John   25    5000          6000         9125
1  Emma   28    6000          7000        10220
2  Ryan   32    7000          8000        11680


##  Dealing with rows
Dealing with rows in a Pandas DataFrame involves operations such as accessing rows, filtering rows based on conditions, adding new rows, and deleting rows.

#### Accessing Rows:
Pandas provides different methods to access rows in a DataFrame:

##### iloc: 
to use the iloc indexer to access rows by their integer location. For example, to access the first row of a DataFrame, you can use df.iloc[0].
iloc: You can use the iloc indexer to access rows by their integer location. For example, to access the first row of a DataFrame, you can use df.iloc[0].

##### loc:
The loc indexer is used to access rows by their label. If your DataFrame has row labels, you can use the loc indexer to access rows by their labels. For example, if you have a DataFrame with row labels and you want to access the row with label "row_label", you can use df.loc["row_label"].

##### Boolean indexing:
to use Boolean indexing to filter rows based on a condition. For example, if you want to select rows where a certain condition is met, you can use a Boolean expression with the DataFrame. For example, df[df["column_name"] > 10] will select rows where the value in "column_name" is greater than 10.

In [25]:
import pandas as pd
# Create a DataFrame
data = {
    "Name": ["John", "Emma", "Ryan"],    
    "Age": [25, 28, 32],
    "City": ["New York", "Paris", "London"]}
df = pd.DataFrame(data, index = ['A', 'B','C'])
# print(df)
# Accessing rows using iloc
# first_row = df.iloc[0]  # Access the first row# print("First row (iloc):\n", first_row)
# second_row = df.iloc[1]  # Access the second row
# print("Second row (iloc):\n", second_row)# print("***********************")
# Accessing rows using loc
# first_row = df.iloc[0]  # Access the first row# print("First row (loc):\n", first_row)
# second_row = df.loc["Name"]  # Access the second row
# print("Second row (loc):\n", second_row)
df.loc['A']



Name        John
Age           25
City    New York
Name: A, dtype: object

### Adding Rows:
To add a new row to a DataFrame, you can use the append() function or the loc indexer.



In [26]:
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'Name': ['John', 'Emma'],
                   'Age': [25, 28]})

# Create a new row as a DataFrame
new_row = pd.DataFrame({'Name': ['Ryan'], 'Age': [32]})

# Concatenate the existing DataFrame with the new row
df = pd.concat([df, new_row], ignore_index=True)

print(df)
# ignore_index=True argument is provided to reset the index of the concatenated DataFrame.


   Name  Age
0  John   25
1  Emma   28
2  Ryan   32


#### row deletion
To delete rows from a Pandas DataFrame, you can use the drop() method or boolean indexing

In [60]:
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'Name': ['John', 'Emma', 'Ryan'],
                   'Age': [25, 28, 32]})

# Method 1: Deleting rows using drop()
df_drop = df.drop(1)  # Delete the row with index 1
print("DataFrame after drop():")
print(df_drop)

# Method 2: Deleting rows using boolean indexing
df_bool = df[df['Name'] != 'Emma']  # Delete rows where Name is 'Emma'
print("\nDataFrame after boolean indexing:")
print(df_bool)


DataFrame after drop():
   Name  Age
0  John   25
2  Ryan   32

DataFrame after boolean indexing:
   Name  Age
0  John   25
2  Ryan   32


#### Pandas DataFrame.truncate

The truncate method in Pandas DataFrame is used to truncate a DataFrame or Series before and/or after a specified index or column label. It allows you to filter and keep only the data within a specific range.

DataFrame.truncate(before=None, after=None, axis=None, copy=True)

Parameters:

1. before: The label or index before which to truncate the DataFrame.          
2. after: The label or index after which to truncate the DataFrame.              
3. axis: Specifies the axis along which to truncate. By default, it is set to None, which means the method will truncate the rows.         
4. copy: Specifies whether to make a copy of the truncated DataFrame or modify the original DataFrame in place. By default, it is set to True.       

In [61]:
import pandas as pd

# Create a sample DataFrame
data = {'A': [1, 2, 3, 4, 5],
        'B': [6, 7, 8, 9, 10]}
df = pd.DataFrame(data)

# Truncate the DataFrame to keep only rows with index 1 and 2
truncated_df = df.truncate(before=2, after=5)

print(truncated_df)


   A   B
2  3   8
3  4   9
4  5  10


### Pandas Series.combine() 
The combine method in Pandas Series is used to combine two Series into a new Series by applying a given function element-wise. It allows you to perform custom operations on two Series and generate a new Series based on those operations.

Series.combine(other, func, fill_value=None)

Parameters:

1. other: The other Series to combine with.
2. func: A function that specifies how to combine the values from the two Series. This function takes two arguments, corresponding to the values from the two Series, and returns a single value.
3. fill_value: An optional parameter that specifies a value to fill missing values with. If one of the Series has missing values where the other Series has data, the fill_value will be used in those cases.


In [39]:
import pandas as pd

# Create two sample Series
s1 = pd.Series([1, 2, 3, 4, 5])
s2 = pd.Series([10, 20, 30, 40, 50])

# Define a function to combine the values from the two Series
def custom_function(x, y):
    return x + y

# Combine the two Series using the custom function
combined_series = s1.combine(s2, custom_function)

print(combined_series)


0    11
1    22
2    33
3    44
4    55
dtype: int64


In [40]:
import pandas as pd
import numpy as np

# Create two sample Series with null values
s1 = pd.Series([1, 2, np.nan, 4, 5])
s2 = pd.Series([10, 20, 30, np.nan, 50])

# Define a function to combine the values from the two Series
def custom_function(x, y):
    return x + y

# Combine the two Series using the custom function and fill null values with 5
combined_series = s2.combine(s1,custom_function, fill_value=5)
# print(s1)

print(combined_series)


0    11.0
1    22.0
2     NaN
3     NaN
4    55.0
dtype: float64


In [48]:
import pandas as pd
import numpy as np

# Create two sample Series with null values
s1 = pd.Series([1, 2, np.nan, 4, 5])
s2 = pd.Series([10, 20, 30, np.nan, 50])

# Define a function to combine the values from the two Series
def custom_function(x, y):
    return x + y

# Combine the two Series using the custom function and fill null values with 5
combined_series = s2.combine(s1, custom_function, fill_value=5)

print(combined_series)


0    11.0
1    22.0
2     NaN
3     NaN
4    55.0
dtype: float64


In [46]:
import numpy as np 
import pandas as pd             
first =[1, 2, np.nan,9] 
#         0  1   2      3  4  5   6
# creating second series 
second =[5, 3, 3,np.nan] 
#         0   1   2   3     4   5  6
# making series 
first = pd.Series(first) 
 
# making seriesa 
second = pd.Series(second) 
 
# calling .combine() method 
result = first.combine(second, func =(lambda x1, x2: x1 if x1 > x2 else x2), fill_value = 1) 
 
# display 
print(second)

0    5.0
1    3.0
2    3.0
3    NaN
dtype: float64


#### Pandas Index.append()
The Index.append() method in pandas is used to concatenate or append one index with another. It returns a new index object that includes all the elements from both indexes.

The syntax for Index.append() is as follows:


Index.append(other)



Index refers to the initial index object, and other represents the index object that you want to append to the initial index

In [65]:
import pandas as pd

index1 = pd.Index(['A', 'B', 'C'])
index2 = pd.Index(['D', 'E', 'F'])

appended_index = index1.append(index2)
print(appended_index)


Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')


### The DataFrame.append()
The DataFrame.append() method in pandas is used to append rows of one DataFrame to another DataFrame. It concatenates the rows from one DataFrame to the end of another DataFrame and returns a new DataFrame. The column labels of the two DataFrames should be the same for proper alignment of the data.

The syntax for DataFrame.append() is as follows:

###### DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False)

###### pd.concat(objs, axis=0, join='outer', ignore_index=False)


DataFrame refers to the initial DataFrame, and other represents the DataFrame that you want to append to the initial DataFrame. The ignore_index, verify_integrity, and sort parameters are optional and can be used to control the behavior of the append operation

In [70]:
import pandas as pd

df1 = pd.DataFrame({'Name': ['John', 'Alice'],
         'Age': [25, 30],
         'City': ['New York', 'London']})

df2 = pd.DataFrame({'Name': ['Bob'],
         'Age': [35],
         'City': ['Paris']})


print( pd.concat([df1, df2]))


    Name  Age      City
0   John   25  New York
1  Alice   30    London
0    Bob   35     Paris


### concat by using axis

To concatenate DataFrames using the pd.concat() function along a specific axis, you can specify the axis parameter. The axis parameter determines the axis along which the concatenation should occur. By default, axis is set to 0, which means concatenation along the rows. If you want to concatenate along the columns, you can set axis to 1.

In [27]:
# on axis =1
import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3],
         'B': [4, 5, 6]})

df2 = pd.DataFrame ({'C': [7, 8, 9],
         'D': [10, 11, 12]})


print(pd.concat([df1, df2], axis=1))


   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12


In [72]:
# on axis =0
import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3],
         'B': [4, 5, 6]})

df2 = pd.DataFrame ({'C': [7, 8, 9],
         'D': [10, 11, 12]})


print(pd.concat([df1, df2], axis=0))

     A    B    C     D
0  1.0  4.0  NaN   NaN
1  2.0  5.0  NaN   NaN
2  3.0  6.0  NaN   NaN
0  NaN  NaN  7.0  10.0
1  NaN  NaN  8.0  11.0
2  NaN  NaN  9.0  12.0


### To concatenate DataFrames using the pd.concat()
function with a specific join method, you can use the join parameter. The join parameter determines how the other axes (if any) should be handled during the concatenation process.

The join parameter in pd.concat() can take the following values:

'outer': Performs a union of the indexes/columns. This is the default behavior.
'inner': Performs an intersection of the indexes/columns.

In [2]:
import pandas as pd


df1 = pd.DataFrame({'A': [1, 2, 3],
         'B': [4, 5, 6]})

df2 = pd.DataFrame ({'C': [7, 8, 9],
         'D': [10, 11, 12]})

print("Concatenation with 'outer' join:")
print(pd.concat([df1, df2], join='outer'))
print("\nConcatenation with 'inner' join:")
print(pd.concat([df1, df2], join='inner'))


Concatenation with 'outer' join:
     A    B    C     D
0  1.0  4.0  NaN   NaN
1  2.0  5.0  NaN   NaN
2  3.0  6.0  NaN   NaN
0  NaN  NaN  7.0  10.0
1  NaN  NaN  8.0  11.0
2  NaN  NaN  9.0  12.0

Concatenation with 'inner' join:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 0, 1, 2]


### Grouping Rows in pandas

rows together based on a specific column or a set of columns using the groupby() function. Grouping allows you to aggregate data, apply calculations within each group, or perform other operations on grouped data.

The basic syntax for grouping rows in pandas is as follows:

grouped = dataframe.groupby(by)


In [74]:
import pandas as pd

data = {'Name': ['John', 'Alice', 'Bob', 'Alice'],
        'Age': [25, 30, 35, 28],
        'City': ['New York', 'London', 'Paris', 'London']}

df = pd.DataFrame(data)

# Grouping by 'City'
grouped = df.groupby('City')

# Calculating the mean age for each city
mean_age = grouped['Age'].mean()

print(mean_age)


City
London      29.0
New York    25.0
Paris       35.0
Name: Age, dtype: float64


we can perform various operations on grouped data, such as aggregating using functions like sum(), count(), min(), max(), or applying custom functions using apply(). Additionally, you can also group by multiple columns by passing a list of column names to groupby().

### the str.join() 
In pandas, the str.join() method is used to concatenate string or list elements with a passed delimiter. This method is available for string-type columns or columns containing lists of strings.

The basic syntax for str.join() is as follows:

series.str.join(sep)


In [75]:
import pandas as pd

data = {'Name': ['John', 'Alice', 'Bob'],
        'Fruit': [['Apple', 'Banana'], ['Orange'], ['Grapes', 'Mango']]}

df = pd.DataFrame(data)

# Joining elements in the 'Fruit' column with a comma as delimiter
df['Joined_Fruit'] = df['Fruit'].str.join('_')

print(df)


    Name            Fruit  Joined_Fruit
0   John  [Apple, Banana]  Apple_Banana
1  Alice         [Orange]        Orange
2    Bob  [Grapes, Mango]  Grapes_Mango


In [76]:
import pandas as pd

data = {'Name': ['John', 'Alice', 'Bob'],
        'Fruit': [['Apple', 'Banana'], ['Orange'], ['Grapes', 'Mango']]}

df = pd.DataFrame(data)

# Joining elements in the 'Fruit' column with a comma as delimiter
df['Joined_Fruit'] = df['Fruit'].str.join('+++')

print(df)


    Name            Fruit    Joined_Fruit
0   John  [Apple, Banana]  Apple+++Banana
1  Alice         [Orange]          Orange
2    Bob  [Grapes, Mango]  Grapes+++Mango


### str.cat()

To join two text columns into a single column in pandas, you can use the + operator or the str.cat() method. Both methods allow you to concatenate the values from two columns into a new column.

In [80]:
#Using the + operator:

import pandas as pd

data = {'First Name': ['John', 'Alice', 'Bob'],
        'Last Name': ['Doe', 'Smith', 'Johnson']}

df = pd.DataFrame(data)

# Joining 'First Name' and 'Last Name' columns using the + operator
df['Full Name  '] = df['First Name'] + ' ' + df['Last Name']

print(df)


  First Name Last Name  Full Name  
0       John       Doe     John Doe
1      Alice     Smith  Alice Smith
2        Bob   Johnson  Bob Johnson


In [81]:
# Using the str.cat() method:

import pandas as pd

data = {'First Name': ['John', 'Alice', 'Bob'],
        'Last Name': ['Doe', 'Smith', 'Johnson']}

df = pd.DataFrame(data)

# Joining 'First Name' and 'Last Name' columns using str.cat()
df['Full Name'] = df['First Name'].str.cat(df['Last Name'], sep=' ')

print(df)


  First Name Last Name    Full Name
0       John       Doe     John Doe
1      Alice     Smith  Alice Smith
2        Bob   Johnson  Bob Johnson


### The lambda function
to use a lambda function with the apply() method to join two text columns into a single column in pandas. The lambda function allows you to define a custom transformation on each row of the DataFrame.

In [82]:
import pandas as pd

data = {'First Name': ['John', 'Alice', 'Bob'],
        'Last Name': ['Doe', 'Smith', 'Johnson']}

df = pd.DataFrame(data)

# Joining 'First Name' and 'Last Name' columns using a lambda function
df['Full Name'] = df.apply(lambda row: row['First Name'] + ' ' + row['Last Name'], axis=1)

print(df)


  First Name Last Name    Full Name
0       John       Doe     John Doe
1      Alice     Smith  Alice Smith
2        Bob   Johnson  Bob Johnson


###  date and time 

Pandas provides robust functionality for working with date and time data through its Timestamp and DatetimeIndex objects. This allows you to handle various operations on dates and times, including parsing, formatting, indexing, arithmetic, and more.


##### Parsing and Creating Dates: 
to parse strings representing dates or times using the pd.to_datetime() function. It converts a wide range of date and time formats into Timestamp objects.



In [84]:
import pandas as pd

date_string = '2023-05-24'
date = pd.to_datetime(date_string)
print(date)


2023-05-24 00:00:00


###### DatetimeIndex:
Pandas has a specialized index object called DatetimeIndex, which allows you to index and manipulate data based on dates and times. You can create a DatetimeIndex from a list of dates or use it directly as the index of a DataFrame.

In [85]:
import pandas as pd

dates = ['2023-05-23', '2023-05-24', '2023-05-25']
values = [10, 20, 30]
df = pd.DataFrame({'Value': values}, index=pd.to_datetime(dates))
print(df)


            Value
2023-05-23     10
2023-05-24     20
2023-05-25     30


###### Date Range Generation:
Pandas provides the pd.date_range() function to generate a range of dates based on specified frequency.

In [3]:
import pandas as pd

dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
print(dates)
# freq='D': This parameter determines the frequency or interval between the dates. In this case,
#     it is set to 'D', which stands for 'day'. This means that the list will include all dates from the start date
#     to the end date, with each date separated by one day

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
               '2023-01-09', '2023-01-10',
               ...
               '2023-12-22', '2023-12-23', '2023-12-24', '2023-12-25',
               '2023-12-26', '2023-12-27', '2023-12-28', '2023-12-29',
               '2023-12-30', '2023-12-31'],
              dtype='datetime64[ns]', length=365, freq='D')


###### Resampling and Shifting:
Pandas provides functions for resampling time series data, such as upsampling (increasing frequency) and downsampling (decreasing frequency), and shifting dates and times.

In [8]:
# resampling
import pandas as pd

# Create a DataFrame with a DatetimeIndex
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
values = range(len(dates))
df = pd.DataFrame({'Value': values}, index=dates)
df

# Resample the data to a monthly frequency and calculate the mean
monthly_mean = df['Value'].resample('M').mean()
print(monthly_mean)
# m means resample data on monthly basis


2023-01-31     15.0
2023-02-28     44.5
2023-03-31     74.0
2023-04-30    104.5
2023-05-31    135.0
2023-06-30    165.5
2023-07-31    196.0
2023-08-31    227.0
2023-09-30    257.5
2023-10-31    288.0
2023-11-30    318.5
2023-12-31    349.0
Freq: M, Name: Value, dtype: float64


In [6]:
# shifting
import pandas as pd

# Create a DataFrame with a DatetimeIndex
dates = pd.date_range(start='2023-01-01', periods=5, freq='D')
values = [10, 20, 30, 40, 50]
df = pd.DataFrame({'Value': values}, index=dates)

# Shift the values one period forward
shifted = df['Value'].shift(1)
print(shifted)


2023-01-01     NaN
2023-01-02    10.0
2023-01-03    20.0
2023-01-04    30.0
2023-01-05    40.0
Freq: D, Name: Value, dtype: float64


###### Time-based Indexing and Slicing
Time-based Indexing and Slicing in Pandas allow you to select and manipulate data based on specific dates or time ranges

In [9]:
#Indexing with Dates
import pandas as pd

# Create a DataFrame with a DatetimeIndex
dates = pd.date_range(start='2023-01-01', periods=5, freq='D')
values = [10, 20, 30, 40, 50]
df = pd.DataFrame({'Value': values}, index=dates)

# Index the DataFrame using a specific date
specific_date = pd.to_datetime('2023-01-04')
specific_row = df.loc[specific_date]
print(specific_row)


Value    40
Name: 2023-01-04 00:00:00, dtype: int64


In [91]:
# Slicing Date Ranges:
import pandas as pd

# Create a DataFrame with a DatetimeIndex
dates = pd.date_range(start='2023-01-01', periods=10, freq='D')
values = range(10)
df = pd.DataFrame({'Value': values}, index=dates)

# Slice the DataFrame for a specific date range
date_range = pd.date_range(start='2023-01-03', end='2023-01-06')
sliced_data = df.loc[date_range]
print(sliced_data)


            Value
2023-01-03      2
2023-01-04      3
2023-01-05      4
2023-01-06      5


##### pandas timestamp now
To get the current timestamp in Pandas, you can use the pd.Timestamp.now() function. It returns the current date and time as a Pandas Timestamp object.

In [10]:
import pandas as pd

current_time = pd.Timestamp.now()
print(current_time)

2023-06-13 14:41:45.424220


##### pandas timestamp isoformat
The Timestamp.isoformat() method is used to return the string representation of a timestamp in ISO 8601 format. It converts a Pandas Timestamp object into a string that represents the timestamp in a standardized format.

In [94]:
tz='us/eastern'ts=pd.Timestamp(year=2001,month=3,day=13,hour=4,second=30,)
print(ts.isoformat())

2001-03-13T04:00:30-05:00


###### Pandas Timestamp.date
Timestamp.date attribute is used to extract the date component from a Pandas Timestamp object. It returns a Python datetime.date object representing the date portion of the timestamp.

In [98]:
ts=pd.Timestamp(year=2001,month=3,day=13,hour=4,second=30,tz='us/eastern')
print(ts.date())

2001-03-13


###### Timestamp.replace() 
Timestamp.replace()method is used to create a new Timestamp object with specific components replaced. It allows you to modify individual components of a timestamp, such as the year, month, day, hour, minute, second, microsecond, or time zone.

In [99]:
ts=pd.Timestamp(year=2001,month=3,day=13,hour=4,second=30,tz='us/eastern')
print(ts.replace(year=2023))

2023-03-13 04:00:30-04:00


###### pd.to_datetime()
function in Pandas is a powerful tool for converting various types of input into Pandas Timestamp objects. It can handle a wide range of input formats and returns a DateTime-like output.

In [100]:
import pandas as pd

date_list = ['2023-05-24', '2023-05-25', '2023-05-26']
timestamps = pd.to_datetime(date_list)
print(timestamps)


DatetimeIndex(['2023-05-24', '2023-05-25', '2023-05-26'], dtype='datetime64[ns]', freq=None)


###  the Series.str.lower(), Series.str.upper(), and Series.str.title() 
the methods are used to transform the elements of a Series to lowercase, uppercase, and title case, respectively. These methods are particularly useful when working with string data in a Series.

In [101]:
# Series.str.lower()
import pandas as pd

data = ['Hello', 'WORLD', 'pandas']
series = pd.Series(data)
lower_series = series.str.lower()
print(lower_series)


0     hello
1     world
2    pandas
dtype: object


In [102]:
# Series.str.upper()
import pandas as pd

data = ['Hello', 'WORLD', 'pandas']
series = pd.Series(data)
upper_series = series.str.upper()
print(upper_series)


0     HELLO
1     WORLD
2    PANDAS
dtype: object


In [103]:
#Series.str.title()
import pandas as pd

data = ['hello world', 'pandas is awesome']
series = pd.Series(data)
title_series = series.str.title()
print(title_series)


0          Hello World
1    Pandas Is Awesome
dtype: object


###  Series.str.replace()
Series.str.replace() method is used to replace substrings or patterns in the elements of a Series with a specified replacement string. It allows you to perform string replacements in a flexible and efficient manner.

In [104]:
import pandas as pd

data = ['apple', 'banana', 'cherry']
series = pd.Series(data)
modified_series = series.str.replace('a', 'X')
print(modified_series)

0     Xpple
1    bXnXnX
2    cherry
dtype: object


### Series.str.strip(), lstrip() and rstrip()

the Series.str.strip(), Series.str.lstrip(), and Series.str.rstrip() methods are used to remove leading and trailing whitespace characters from the elements of a Series. These methods are useful for cleaning and preprocessing string data by eliminating unwanted whitespace.

##### Series.str.strip():
This method removes leading and trailing whitespace characters from the elements of a Series.

In [105]:
import pandas as pd

data = ['  apple  ', '  banana', 'cherry  ']
series = pd.Series(data)
stripped_series = series.str.strip()
print(stripped_series)


0     apple
1    banana
2    cherry
dtype: object


##### Series.str.lstrip(): 
This method removes leading (left) whitespace characters from the elements of a Series.

In [106]:
import pandas as pd

data = ['  apple  ', '  banana', 'cherry  ']
series = pd.Series(data)
lstripped_series = series.str.lstrip()
print(lstripped_series)


0     apple  
1      banana
2    cherry  
dtype: object


#### Series.str.rstrip(): 
This method removes trailing (right) whitespace characters from the elements of a Series.

In [107]:
import pandas as pd

data = ['  apple  ', '  banana', 'cherry  ']
series = pd.Series(data)
rstripped_series = series.str.rstrip()
print(rstripped_series)


0       apple
1      banana
2      cherry
dtype: object
