<a href="https://colab.research.google.com/github/satyamnewale/pandas-book/blob/main/day_1-23Nov.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Day_1-23Nov**

Creat new Data:

1. dataframe: Dataframe is a table.  It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

2. series: A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list.

3. Edge Cases & Pitfalls
 - Series can have non-unique or missing indices, which affects lookup and operations.

 - DataFrames can have columns of different types, so beware of operations that assume types are uniform.

 - When working with Series or DataFrames from dicts, missing values may appear (represented by NaN) if keys or indices don’t align.

In [5]:
import pandas as pd

df = pd.DataFrame({"yes":[0,1],"no":[1,0]},index=["product A","product B"] )
print(df)

           yes  no
product A    0   1
product B    1   0


1. From a Python List:

- To make a Series: use pd.Series([elements]).

- To make a DataFrame: use pd.DataFrame([[row1], [row2]]) or, for a single column, pd.DataFrame([list]).

2. From a Dict:

- Series: pd.Series({'a': 1, 'b': 2}) (dict keys become index).

- DataFrame: pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) (dict keys become columns).

3. From a NumPy Array:

- Series: pd.Series(numpy_array).

- DataFrame: pd.DataFrame(numpy_2d_array).

In [14]:
import pandas as pd

series_list = pd.Series([1,2,3,4], index=["A",'B',"C",'D'])
print(series_list)

series_dict = pd.Series({'a': 1, 'b': 2})
print(series_dict)

series_arr = pd.Series([1,2,3,4], index=["A",'B',"C",'D'])
print(series_arr)

df_list = pd.DataFrame([[1,2,3],[4,5,6]], index=['A','B'], columns=['X','Y','Z'])
print(df_list)

df_dict = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
print(df_dict)

df_arr = pd.DataFrame([[1,2,3],[4,5,6]], index=['A','B'], columns=['X','Y','Z'])
print(df_arr)

A    1
B    2
C    3
D    4
dtype: int64
a    1
b    2
dtype: int64
A    1
B    2
C    3
D    4
dtype: int64
     X  Y  Z
A  1.0  2  3
B  4.0  5  6
   col1  col2
0     1     3
1     2     4
   X  Y  Z
A  1  2  3
B  4  5  6


In [23]:
import pandas as pd

df = pd.DataFrame({'col1': [1., 2, 1, 2, 1, 2, 1, 2, 1, 2], 'col2': [3, 4, 3, 4, 3, 4, 3, 4, 3, 4], 'col3': [5, 6, 5, 6, 5, 6, 5, 6, 5, 6]}, index = ['a','b','c','d','e','f','g','h','i','j'])
print(df)

print(f"head: {df.head()} \n")
print(f"tail: {df.tail()} \n")
print(f"describe: {df.describe()} \n")
print(f"info: {df.info()} \n")
print(f"shape: {df.shape} \n")
print(f"columns: {df.columns} \n")
print(f"index: {df.index} \n")
print(f"dtypes: {df.dtypes} \n")

   col1  col2  col3
a   1.0     3     5
b   2.0     4     6
c   1.0     3     5
d   2.0     4     6
e   1.0     3     5
f   2.0     4     6
g   1.0     3     5
h   2.0     4     6
i   1.0     3     5
j   2.0     4     6
head:    col1  col2  col3
a   1.0     3     5
b   2.0     4     6
c   1.0     3     5
d   2.0     4     6
e   1.0     3     5 

tail:    col1  col2  col3
f   2.0     4     6
g   1.0     3     5
h   2.0     4     6
i   1.0     3     5
j   2.0     4     6 

describe:             col1       col2       col3
count  10.000000  10.000000  10.000000
mean    1.500000   3.500000   5.500000
std     0.527046   0.527046   0.527046
min     1.000000   3.000000   5.000000
25%     1.000000   3.000000   5.000000
50%     1.500000   3.500000   5.500000
75%     2.000000   4.000000   6.000000
max     2.000000   4.000000   6.000000 

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  -------------- 

1. What is .loc?

- .loc = Label-based selection
- You select rows/columns using their names (labels).

- Used for:
  - Row labels (e.g., "A", "row1", 5 if index is 5)
  - Column names ("age", "salary")
  - Boolean conditions

- syntax - df.loc[row_label , column_label]

2. What is .iloc?

- .iloc = Integer-location based selection

- You select by position number (like Python lists).

- used for:
  - Row number position: 0,1,2,3…
  - Column number position: 0,1,2…

- synatx - df.iloc[row_position, column_position]

---
When to use which?
1. Use .loc when:

- You know the row names or column names

- Dataset has meaningful index (dates, IDs)

- You use conditions

  - Example: df.loc[df["age"] > 20]

2. Use .iloc when:

- You want fast selection

- You want selection like a list (df[0:5])

- You don't know column names

---

| Feature      | `.loc`       | `.iloc`                 |
| ------------ | ------------ | ----------------------- |
| Select by    | Label (name) | Integer index           |
| Rows         | row labels   | row numbers             |
| Columns      | column names | column numbers          |
| Slice end    | inclusive    | exclusive (like Python) |
| Boolean mask | allowed      | NOT allowed             |


In [43]:
import pandas as pd

df = pd.DataFrame({
    "name": ["A", "B", "C"],
    "age": [20, 25, 30],
    "salary": [50, 60, 70]
}, index=["x", "y", "z"])

print(df)
print(f"{df['name']}, {type(df['name'])} \n")
print(f"{df[['name', 'age']]}, {type(df[['name', 'age']])} \n")

#access using loc()
print(f"{df.loc['x']} \n")
print(f"{df.loc[['x', 'y']]} \n")
print(f"{df.loc[["x","z"],["name","salary"]]} \n")

print(df.iloc[1]) # row wise selection
print(f"\n {df.iloc[2, 1]} \n")
print(f"{df.iloc[0:2, 0:2]} \n") # row-wise , col-wise

#slice :
print(f"loc_slice: {df.loc['x':"z"]} \n")
print(f"iloc_slice: {df.iloc[0:2]} \n")

  name  age  salary
x    A   20      50
y    B   25      60
z    C   30      70
x    A
y    B
z    C
Name: name, dtype: object, <class 'pandas.core.series.Series'> 

  name  age
x    A   20
y    B   25
z    C   30, <class 'pandas.core.frame.DataFrame'> 

name       A
age       20
salary    50
Name: x, dtype: object 

  name  age  salary
x    A   20      50
y    B   25      60 

  name  salary
x    A      50
z    C      70 

name       B
age       25
salary    60
Name: y, dtype: object

 30 

  name  age
x    A   20
y    B   25 

loc_slice:   name  age  salary
x    A   20      50
y    B   25      60
z    C   30      70 

iloc_slice:   name  age  salary
x    A   20      50
y    B   25      60 

