# Introduction To Pandas üêº

- Pandas is a popular Python library used for data manipulation and analysis. It makes working with structured data (like tables, CSVs, Excel files, SQL data) very easy.

- It leverages speed and power of NumPy to make data analysis and preprocessing easy

## 1. Data Structures

- Series: 1D labeled array (like a single column).

- DataFrame: 2D labeled table (rows √ó columns, like a spreadsheet).

# üìä CREATING DATA 

## What Is A DataFrame?

- A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column. 

In [None]:
#dataframes can have integers 

import pandas as pd

pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})


#dataframes can also have strings 

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})


#changing index from default 0,1,2,3 to our will by using index parameters 

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']}, 
              index=['Product A', 'Product B '])

Unnamed: 0,Bob,Sue
Product A,I liked it.,Pretty good.
Product B,It was awful.,Bland.


## What Is A Series?

- A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. 

In [5]:
#creating a simple series with a list

pd.Series([1,2,3,4,5])


#A Series is, in essence, a single column of a DataFrame. 
# So you can assign row labels to the Series the same way as before, using an index parameter. 
# However, a Series does not have a column name, it only has one overall name:

pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

# üìñ READING DATA 


## Reading data files:

- Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.

- Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file.

## 2. Data Operations

- Reading/writing data: CSV, Excel, SQL, JSON.

- Selecting, filtering, and slicing data.

- Handling missing data (NaN) with methods like dropna() and fillna().

- Aggregations: sum, mean, count, groupby operations.

- Sorting, merging, joining, and reshaping datasets.

## Summary Functions

- .describe(): This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data

- .mean(): to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the mean() function

- .unique(): To see a list of unique values

- .value_counts(): To see a list of unique values and how often they occur in the dataset



## Maps

- A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!



### Common Mapping Methods

#### map():

- map() is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0.

- The function you pass to map() should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.


#### apply():

-  It is an equivalent method if we want to transform a whole DataFrame by calling a custom method on each row.



### Key Takeaways:

- Note that map() and apply() return new, transformed Series and DataFrames, respectively. They don't modify the original data they're called on. If we look at the first row of reviews, we can see that it still has its original points value.

## GroupWise Data Analysis

- Maps allow us to transform data in a DataFrame or Series one value at a time for an entire column. However, often we want to group our data, and then do something specific to the group the data is in.

-  We do this with the groupby() operation.


In [None]:
#One function we've been using heavily thus far is the value_counts() function.
#We can replicate what value_counts() does by doing the following:


import pandas as pd

reviews = pd.read_csv('', index_col=0)
pd.set_option("display.max_rows", 5)
reviews.groupby('points').points.count()


### Types Of Groupby() Methods:

- apply(): apply() lets you run a custom function on each group of a grouped DataFrame or Series.
It‚Äôs very flexible ‚Äî you can compute anything you like for each group.

- agg() is used to compute one or more summary statistics for each group, like mean, sum, min, max, etc.
It‚Äôs more structured than apply().

In [4]:
#demonstrating apply()

import pandas as pd

data = pd.DataFrame({
    'country': ['US', 'US', 'FR', 'FR', 'IT'],
    'points': [90, 92, 88, 95, 85]
})

grouped = data.groupby('country')

#now we want range of points for each country max to min

rangepoints = grouped['points'].apply(lambda x: x.max()-x.min())
print(rangepoints)


#demonstrating agg()

summary = grouped['points'].agg(['min', 'max', 'mean'])
print(summary)

country
FR    7
IT    0
US    2
Name: points, dtype: int64
         min  max  mean
country                
FR        88   95  91.5
IT        85   85  85.0
US        90   92  91.0


## Multi Indices

- A multi-index differs from a regular index in that it has multiple levels

In [None]:
countries_reviewd = reviews.groupby(['country', 'province']).description.agg([len])

## Sorting

- Groupby gives output based on index order not value order

- Therefore, to retrieve data based on value order, we can simply use sort_values() method

In [None]:
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len')

### Order Of Sorting

- sort_values() defaults to an ascending sort, where the lowest values go first. However, most of the time we want a descending sort, where the higher numbers go first.

In [None]:
countries_reviewed.sort_values(by='len', ascending=False)

### Sorting More Than One Column Is Possible


In [None]:
countries_reviewed.sort_values(by=['country', 'len'])

## Creating DataFrames In 3 Different Ways

### (A): Making Our Own DataFrame

## 1. DataFrame (df)

- Consider it simply like excel sheets 

In [111]:
dict1 = {
    "name" : ["sidra", "harry", "shubh","skillf"],
    "marks" : [92, 43, 24, 17],
    "city" : ["amsterdam", "london", "paris", "japan"]
}

#dataframe is simply like an excel sheet 

df = pd.DataFrame(dict1)

In [112]:
df

Unnamed: 0,name,marks,city
0,sidra,92,amsterdam
1,harry,43,london
2,shubh,24,paris
3,skillf,17,japan


### (B): Using CSV Files To Make DataFrames

## Uploading A CSV File To Work On Data

- For this, we use pd.read_csv

In [113]:
import pandas as pd

data = pd.read_csv("company.csv")
print(data)

     ID     Name          Dept         Position    Salary Joining Performance
0  E001     John  Social Media              SMM   $1,500    5-Nov         45%
1  E002    Peter      Robotics       Roboticist   $4,500    3-Jul         23%
2  E003    Sidra  Data Science   Data Scientist  $50,000    2-Feb         95%
3  E004  Natasha       Web Dev    Web Developer   $2,300   23-Mar         63%
4  E005    Danny      Auditing          Auditor   $4,500   18-Aug         30%
5  E006    Lizzy       Finance  Finance Manager   $1,200   17-Apr         12%


### (C): Using Excel Files To Make A DataFrame

## Uploading An Excel File To Work On Data

- For this we need pd.read_excel

In [114]:
import pandas as pd

data = pd.read_excel("mygrocery.xlsx")
print(data)

   items     name       price available
0      2    bread  2.30 euros       yes
1      4   apples  3.56 euros       yes
2     10  candies  5.84 euros        no
3      3     milk  3.23 euros       yes
4      5   butter  6.89 euros       yes
5      2     fish  4.53 euros       yes
6      1      oil  1.15 euros        no


## 2. Uploading Data On A CSV File?

- For this, we simply use dot to_csv function

- It uploads data on a csv file easily

- Later we can manipulate, analyze and use data to our will

# Exploring Data

- Uploading Data
- Checking data from start using .head() -> by default: it gives 5 from start and 5 from end 
- Getting info about data using .info()
- Checking presence of null values using .isnull()

In [115]:
import pandas as pd

data = pd.read_excel("mygrocery.xlsx")
print(data)
print(data.head(3))
print(data.tail(3))
print(data.info()) #this gives info about data like datatypes, non-nu,, values
#print(data.describe()) #this gives us statistical summary of data of numerical columns
print(data.isnull().sum())

   items     name       price available
0      2    bread  2.30 euros       yes
1      4   apples  3.56 euros       yes
2     10  candies  5.84 euros        no
3      3     milk  3.23 euros       yes
4      5   butter  6.89 euros       yes
5      2     fish  4.53 euros       yes
6      1      oil  1.15 euros        no
   items     name       price available
0      2    bread  2.30 euros       yes
1      4   apples  3.56 euros       yes
2     10  candies  5.84 euros        no
   items    name       price available
4      5  butter  6.89 euros       yes
5      2    fish  4.53 euros       yes
6      1     oil  1.15 euros        no
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   items      7 non-null      int64 
 1   name       7 non-null      object
 2   price      7 non-null      object
 3   available  7 non-null      object
dtypes: int64(1), object(3)
me

## Handling Duplicate Values In data


In [116]:
import pandas as pd

data = pd.read_excel("mygrocery.xlsx")
print(data)
print(data.duplicated())
print(data["items"].duplicated())
print(data["available"].duplicated())

   items     name       price available
0      2    bread  2.30 euros       yes
1      4   apples  3.56 euros       yes
2     10  candies  5.84 euros        no
3      3     milk  3.23 euros       yes
4      5   butter  6.89 euros       yes
5      2     fish  4.53 euros       yes
6      1      oil  1.15 euros        no
0    False
1    False
2    False
3    False
4    False
5    False
6    False
dtype: bool
0    False
1    False
2    False
3    False
4    False
5     True
6    False
Name: items, dtype: bool
0    False
1     True
2    False
3     True
4     True
5     True
6     True
Name: available, dtype: bool


In [171]:
data = pd.read_excel("megrocery.xlsx")
print(data)
print(data["items"].duplicated())
print(data["items"].duplicated().sum())
print(data.drop_duplicates())
print(data.drop_duplicates("items"))


   items     name       price available
0      2    bread  2.30 euros       yes
1      4   apples  3.56 euros       yes
2     10  candies  5.84 euros        no
3      3     milk  3.23 euros       yes
4      5   butter  6.89 euros       yes
5      2     fish  4.53 euros       yes
6      1      oil  1.15 euros        no
7     10  candies  5.84 euros        no
0    False
1    False
2    False
3    False
4    False
5     True
6    False
7     True
Name: items, dtype: bool
2
   items     name       price available
0      2    bread  2.30 euros       yes
1      4   apples  3.56 euros       yes
2     10  candies  5.84 euros        no
3      3     milk  3.23 euros       yes
4      5   butter  6.89 euros       yes
5      2     fish  4.53 euros       yes
6      1      oil  1.15 euros        no
   items     name       price available
0      2    bread  2.30 euros       yes
1      4   apples  3.56 euros       yes
2     10  candies  5.84 euros        no
3      3     milk  3.23 euros       yes
4    

## Working With Missing Values 

In [176]:
import numpy as np
import pandas as pd


data2 = pd.read_csv("missingdata.csv")
print(data2)
print(data2.isnull())
print(data2.isnull().sum())
data2["hobby"] = data2["hobby"].replace(np.nan, "cooking")
print(data2)

    name   age       city        hobby
0  sidra  24.0  amsterdam     studying
1    dan   NaN     london          NaN
2  peter  32.0        NaN     painting
3  lizzi  23.0      paris          NaN
4   john   NaN        NaN          NaN
5    doe   NaN        NaN     swimming
6   alex  27.0     venice  Horseriding
7  sarah   NaN        NaN          NaN
    name    age   city  hobby
0  False  False  False  False
1  False   True  False   True
2  False  False   True  False
3  False  False  False   True
4  False   True   True   True
5  False   True   True  False
6  False  False  False  False
7  False   True   True   True
name     0
age      4
city     4
hobby    4
dtype: int64
    name   age       city        hobby
0  sidra  24.0  amsterdam     studying
1    dan   NaN     london      cooking
2  peter  32.0        NaN     painting
3  lizzi  23.0      paris      cooking
4   john   NaN        NaN      cooking
5    doe   NaN        NaN     swimming
6   alex  27.0     venice  Horseriding
7  sarah  

## 3. Removing Indices From Data?

- For this we can define index=False that removes indices/numbering from data

In [118]:
#we can also remove indices using index = False

df.to_csv("friends.csv", index=False)

## 4. Displaying First Number Of Rows?

- For this, we use df.head(no of rows we want to see)

In [119]:
#we can also see the first number of rows using df.head()

df.head(2)


Unnamed: 0,name,marks,city
0,sidra,92,amsterdam
1,harry,43,london


## 5. Displaying Last Number Of Rows?

- We can display last number of rows in pandas using df.tail(any number of rows)

In [120]:
#we can also see the last number of rows using df.tail()

df.tail(2)

Unnamed: 0,name,marks,city
2,shubh,24,paris
3,skillf,17,japan


## 6. Finding Numerical Stats?

- For this, we can simply use df.describe() to obtain numerical stats of numerical columns 

In [121]:
#we can also check all numerical statistics usinf df.describe()

df.describe()

Unnamed: 0,marks
count,4.0
mean,44.0
std,33.832923
min,17.0
25%,22.25
50%,33.5
75%,55.25
max,92.0


## 7. Reading A CSV File?


In [122]:
sidra = pd.read_csv("sidra.csv")

In [123]:
sidra

Unnamed: 0.9,Unnamed: 0.8,Unnamed: 0.7,Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,train no,speed,city
0,first,first,first,first,first,first,first,first,first,1238480,92,amsterdam
1,second,second,second,second,second,second,second,second,second,3213234,43,london
2,third,third,third,third,third,third,third,third,third,8094380,100,paris
3,fourth,fourth,fourth,fourth,fourth,fourth,fourth,fourth,fourth,213214,17,japan


## 8. Accesing A Column?

- We need to use the variable to which we have saved the file


In [124]:
sidra['speed']
sidra["city"]

0    amsterdam
1       london
2        paris
3        japan
Name: city, dtype: object

## 9. Accessing A Value In A Column?

- For this we need to use the index number while keeping the above format same

In [125]:
sidra['speed'][2]

np.int64(100)

## 10. Changing The Value In CSV File

- We use dot loc to pevent the caveats warning 

In [126]:
sidra.loc[2,'speed'] = 100


In [127]:
sidra

Unnamed: 0.9,Unnamed: 0.8,Unnamed: 0.7,Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,train no,speed,city
0,first,first,first,first,first,first,first,first,first,1238480,92,amsterdam
1,second,second,second,second,second,second,second,second,second,3213234,43,london
2,third,third,third,third,third,third,third,third,third,8094380,100,paris
3,fourth,fourth,fourth,fourth,fourth,fourth,fourth,fourth,fourth,213214,17,japan


In [128]:
sidra.to_csv("sidra.csv", index=False)

In [129]:
sidra

Unnamed: 0.9,Unnamed: 0.8,Unnamed: 0.7,Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,train no,speed,city
0,first,first,first,first,first,first,first,first,first,1238480,92,amsterdam
1,second,second,second,second,second,second,second,second,second,3213234,43,london
2,third,third,third,third,third,third,third,third,third,8094380,100,paris
3,fourth,fourth,fourth,fourth,fourth,fourth,fourth,fourth,fourth,213214,17,japan


## 11. Modifying Indices

- We can modify indices per our will using dot index

In [130]:
sidra.index = ["first", "second", "third", "fourth"]

In [131]:
sidra

Unnamed: 0.9,Unnamed: 0.8,Unnamed: 0.7,Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,train no,speed,city
first,first,first,first,first,first,first,first,first,first,1238480,92,amsterdam
second,second,second,second,second,second,second,second,second,second,3213234,43,london
third,third,third,third,third,third,third,third,third,third,8094380,100,paris
fourth,fourth,fourth,fourth,fourth,fourth,fourth,fourth,fourth,fourth,213214,17,japan


In [132]:
sidra.to_csv("sidra.csv")

## 1. Understanding Series

- It can a column or a row in entire dataframe 

In [133]:
ser = pd.Series(np.random.rand(10))

In [134]:
ser

0    0.719354
1    0.609674
2    0.642184
3    0.824603
4    0.164980
5    0.179831
6    0.709756
7    0.162371
8    0.669432
9    0.902258
dtype: float64

In [135]:
type(ser)

pandas.core.series.Series

## 2. Understanding DataFrame:

- It has multiple series in it 

In [136]:
newdf = pd.DataFrame(np.random.rand(334,5), index=np.arange(334))

In [137]:
newdf

Unnamed: 0,0,1,2,3,4
0,0.275578,0.183126,0.194099,0.280540,0.078490
1,0.627029,0.355806,0.083501,0.126567,0.069481
2,0.547641,0.924488,0.355668,0.109551,0.637586
3,0.100059,0.563506,0.371198,0.774051,0.159864
4,0.207568,0.519403,0.932906,0.436431,0.612309
...,...,...,...,...,...
329,0.836020,0.376977,0.817526,0.919051,0.960317
330,0.607135,0.949423,0.756857,0.802647,0.123396
331,0.214691,0.675592,0.961266,0.916164,0.100680
332,0.057066,0.236309,0.976636,0.708277,0.684549


In [138]:
type(newdf)

pandas.core.frame.DataFrame

In [139]:
newdf.dtypes

0    float64
1    float64
2    float64
3    float64
4    float64
dtype: object

In [140]:
newdf[0][0] = 0.3

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf[0][0] = 0.3


In [141]:
newdf

Unnamed: 0,0,1,2,3,4
0,0.300000,0.183126,0.194099,0.280540,0.078490
1,0.627029,0.355806,0.083501,0.126567,0.069481
2,0.547641,0.924488,0.355668,0.109551,0.637586
3,0.100059,0.563506,0.371198,0.774051,0.159864
4,0.207568,0.519403,0.932906,0.436431,0.612309
...,...,...,...,...,...
329,0.836020,0.376977,0.817526,0.919051,0.960317
330,0.607135,0.949423,0.756857,0.802647,0.123396
331,0.214691,0.675592,0.961266,0.916164,0.100680
332,0.057066,0.236309,0.976636,0.708277,0.684549


In [142]:
newdf.index

Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
       ...
       324, 325, 326, 327, 328, 329, 330, 331, 332, 333],
      dtype='int64', length=334)

In [143]:
newdf.to_numpy()

array([[0.3       , 0.18312578, 0.19409854, 0.28054028, 0.07848966],
       [0.62702865, 0.35580577, 0.08350055, 0.12656684, 0.06948126],
       [0.5476414 , 0.92448782, 0.35566813, 0.1095513 , 0.63758607],
       ...,
       [0.21469086, 0.67559214, 0.96126601, 0.91616351, 0.10067975],
       [0.05706569, 0.23630931, 0.97663648, 0.70827703, 0.68454863],
       [0.38348719, 0.13073654, 0.53146201, 0.82673684, 0.10065911]])

In [144]:
newdf.sort_index(axis=1, ascending=False)

Unnamed: 0,4,3,2,1,0
0,0.078490,0.280540,0.194099,0.183126,0.300000
1,0.069481,0.126567,0.083501,0.355806,0.627029
2,0.637586,0.109551,0.355668,0.924488,0.547641
3,0.159864,0.774051,0.371198,0.563506,0.100059
4,0.612309,0.436431,0.932906,0.519403,0.207568
...,...,...,...,...,...
329,0.960317,0.919051,0.817526,0.376977,0.836020
330,0.123396,0.802647,0.756857,0.949423,0.607135
331,0.100680,0.916164,0.961266,0.675592,0.214691
332,0.684549,0.708277,0.976636,0.236309,0.057066


In [145]:
type(newdf[0])

pandas.core.series.Series

## View Behaviour Of DataFrames

- Any changes made to newdf will be applied on df as well since newdf is the view of original df 

In [146]:
newdf2 = newdf

In [147]:
newdf[0][0] = 95794

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf[0][0] = 95794


In [148]:
newdf

Unnamed: 0,0,1,2,3,4
0,95794.000000,0.183126,0.194099,0.280540,0.078490
1,0.627029,0.355806,0.083501,0.126567,0.069481
2,0.547641,0.924488,0.355668,0.109551,0.637586
3,0.100059,0.563506,0.371198,0.774051,0.159864
4,0.207568,0.519403,0.932906,0.436431,0.612309
...,...,...,...,...,...
329,0.836020,0.376977,0.817526,0.919051,0.960317
330,0.607135,0.949423,0.756857,0.802647,0.123396
331,0.214691,0.675592,0.961266,0.916164,0.100680
332,0.057066,0.236309,0.976636,0.708277,0.684549


## Hardcoding Original DataFrame To Dot Copy

- We can hardcode dot copy to original old df to prevent any changes made to the newdf being applied on it also

- Here newdf remains the same, but newdf2 does not since newdf has been chardocded with .copy()

In [149]:
newdf2 = newdf.copy()

In [150]:
newdf2[0][0] = 59870

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf2[0][0] = 59870


In [151]:
newdf2

Unnamed: 0,0,1,2,3,4
0,59870.000000,0.183126,0.194099,0.280540,0.078490
1,0.627029,0.355806,0.083501,0.126567,0.069481
2,0.547641,0.924488,0.355668,0.109551,0.637586
3,0.100059,0.563506,0.371198,0.774051,0.159864
4,0.207568,0.519403,0.932906,0.436431,0.612309
...,...,...,...,...,...
329,0.836020,0.376977,0.817526,0.919051,0.960317
330,0.607135,0.949423,0.756857,0.802647,0.123396
331,0.214691,0.675592,0.961266,0.916164,0.100680
332,0.057066,0.236309,0.976636,0.708277,0.684549


In [152]:
newdf

Unnamed: 0,0,1,2,3,4
0,95794.000000,0.183126,0.194099,0.280540,0.078490
1,0.627029,0.355806,0.083501,0.126567,0.069481
2,0.547641,0.924488,0.355668,0.109551,0.637586
3,0.100059,0.563506,0.371198,0.774051,0.159864
4,0.207568,0.519403,0.932906,0.436431,0.612309
...,...,...,...,...,...
329,0.836020,0.376977,0.817526,0.919051,0.960317
330,0.607135,0.949423,0.756857,0.802647,0.123396
331,0.214691,0.675592,0.961266,0.916164,0.100680
332,0.057066,0.236309,0.976636,0.708277,0.684549


## Avoiding Copy Warning

- We can simply use .loc() to avoid copy warning 

# Difference Between .loc and .iloc

- loc, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using loc instead. For example, here's one operation that's much easier using loc

- iloc is conceptually simpler than loc because it ignores the dataset's indices. When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position.

## Choosing Between .loc and .iloc

- iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. 

- loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

- This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999]

### 1. .loc:

- it helps us access rows and columns using their row/column numbers and index numbers both

In [153]:
newdf.loc[0,0] = 654
newdf.head(2)

Unnamed: 0,0,1,2,3,4
0,654.0,0.183126,0.194099,0.28054,0.07849
1,0.627029,0.355806,0.083501,0.126567,0.069481


In [154]:
newdf.columns = list("ABCDE")
newdf

Unnamed: 0,A,B,C,D,E
0,654.000000,0.183126,0.194099,0.280540,0.078490
1,0.627029,0.355806,0.083501,0.126567,0.069481
2,0.547641,0.924488,0.355668,0.109551,0.637586
3,0.100059,0.563506,0.371198,0.774051,0.159864
4,0.207568,0.519403,0.932906,0.436431,0.612309
...,...,...,...,...,...
329,0.836020,0.376977,0.817526,0.919051,0.960317
330,0.607135,0.949423,0.756857,0.802647,0.123396
331,0.214691,0.675592,0.961266,0.916164,0.100680
332,0.057066,0.236309,0.976636,0.708277,0.684549


In [155]:
newdf.loc[0,"A"] = 65445
newdf.head()

Unnamed: 0,A,B,C,D,E
0,65445.0,0.183126,0.194099,0.28054,0.07849
1,0.627029,0.355806,0.083501,0.126567,0.069481
2,0.547641,0.924488,0.355668,0.109551,0.637586
3,0.100059,0.563506,0.371198,0.774051,0.159864
4,0.207568,0.519403,0.932906,0.436431,0.612309


In [156]:
newdf.drop("E", axis=1)

Unnamed: 0,A,B,C,D
0,65445.000000,0.183126,0.194099,0.280540
1,0.627029,0.355806,0.083501,0.126567
2,0.547641,0.924488,0.355668,0.109551
3,0.100059,0.563506,0.371198,0.774051
4,0.207568,0.519403,0.932906,0.436431
...,...,...,...,...
329,0.836020,0.376977,0.817526,0.919051
330,0.607135,0.949423,0.756857,0.802647
331,0.214691,0.675592,0.961266,0.916164
332,0.057066,0.236309,0.976636,0.708277


In [157]:
newdf.loc[[1,2], ["C", "D"]]

Unnamed: 0,C,D
1,0.083501,0.126567
2,0.355668,0.109551


## Running Complex Query

- Running query for the data smaller than 0.3 and bigger than 0.1

In [158]:
newdf.loc[(newdf["A"]<0.3) & (newdf['C']>0.1 )]

Unnamed: 0,A,B,C,D,E
3,0.100059,0.563506,0.371198,0.774051,0.159864
4,0.207568,0.519403,0.932906,0.436431,0.612309
8,0.129277,0.986903,0.808325,0.245762,0.107715
9,0.046717,0.788777,0.560421,0.782056,0.775858
14,0.172214,0.850192,0.895023,0.210699,0.873522
...,...,...,...,...,...
321,0.015559,0.336059,0.920602,0.343758,0.785206
322,0.063713,0.803495,0.849964,0.128888,0.721158
324,0.154655,0.104284,0.869719,0.000650,0.817159
331,0.214691,0.675592,0.961266,0.916164,0.100680


### 2. .iloc:

- We can use indices to get a desired value

In [159]:
newdf.head(2)


Unnamed: 0,A,B,C,D,E
0,65445.0,0.183126,0.194099,0.28054,0.07849
1,0.627029,0.355806,0.083501,0.126567,0.069481


In [160]:
#it starts from 0 and counts untill 4

newdf.iloc[0,3]

np.float64(0.28054028191466307)

In [161]:
newdf.iloc[[0,5], [1,2]]

Unnamed: 0,B,C
0,0.183126,0.194099
5,0.022351,0.607954


## Using inplace=True To Modify Original Data

In [162]:
newdf.drop(["A", "D"], axis=1, inplace=True)
newdf

Unnamed: 0,B,C,E
0,0.183126,0.194099,0.078490
1,0.355806,0.083501,0.069481
2,0.924488,0.355668,0.637586
3,0.563506,0.371198,0.159864
4,0.519403,0.932906,0.612309
...,...,...,...
329,0.376977,0.817526,0.960317
330,0.949423,0.756857,0.123396
331,0.675592,0.961266,0.100680
332,0.236309,0.976636,0.684549


## Using .reset_index To Reset The Index

- Doing this the index restarts from 0 however it adds a new column called index as well

- To remove this index column if we want, we can define drop=True

In [163]:
newdf.head(3)
newdf.reset_index(drop=True)

Unnamed: 0,B,C,E
0,0.183126,0.194099,0.078490
1,0.355806,0.083501,0.069481
2,0.924488,0.355668,0.637586
3,0.563506,0.371198,0.159864
4,0.519403,0.932906,0.612309
...,...,...,...
329,0.376977,0.817526,0.960317
330,0.949423,0.756857,0.123396
331,0.675592,0.961266,0.100680
332,0.236309,0.976636,0.684549


## Using df.dropna()

- It removes missing values from our data

## 1. Removing Entire Rows With Missing Values

In [164]:

import pandas as pd

df2 = pd.DataFrame({ "name": ["Ali", "Sara", None],
    "age": [20, None, 25]})

print("original:")
print(df2)


print("new:")
print(df2.dropna())

original:
   name   age
0   Ali  20.0
1  Sara   NaN
2  None  25.0
new:
  name   age
0  Ali  20.0


## 2. Drop Rows Only If All Values Are Missing


In [165]:
df2.dropna(how="all")
print(df2)

   name   age
0   Ali  20.0
1  Sara   NaN
2  None  25.0


## 3. Drop Rows Where Age Is A Missing Value

In [166]:
df2 = df2.dropna(subset=["age"])
print(df2)

   name   age
0   Ali  20.0
2  None  25.0


## Removing Duplicate Values

## 1. Drop Duplicates Normally

In [167]:
import pandas as pd

df = pd.DataFrame({
    "name": ["Ali", "Sara", "Ali", "John"],
    "age": [20, 22, 20, 30]
})

print("original:")
print(df)


print("new:")
print(df.drop_duplicates())

original:
   name  age
0   Ali   20
1  Sara   22
2   Ali   20
3  John   30
new:
   name  age
0   Ali   20
1  Sara   22
3  John   30


## 2. Drop Duplicates Column Wise

In [168]:
newdf = df.drop_duplicates(subset=["name"])
print(newdf)

   name  age
0   Ali   20
1  Sara   22
3  John   30


## 3. Drop Duplicates While Keeping The Last Occurrence

In [169]:
newdf1 = df.drop_duplicates(keep="last")
print(newdf1)

   name  age
1  Sara   22
2   Ali   20
3  John   30


In [170]:
df2 = df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': [np.nan, np.nan, np.nan, np.nan, np.nan],
    'rating': [pd.NaT, 4, 3.5, 15, 5]
})

df.head()
df.dropna()
df.drop_duplicates(subset=["brand"])
df.info()
df['rating'].value_counts(dropna=True)
df.notnull()
df.isnull()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   brand   5 non-null      object 
 1   style   0 non-null      float64
 2   rating  4 non-null      object 
dtypes: float64(1), object(2)
memory usage: 252.0+ bytes


Unnamed: 0,brand,style,rating
0,False,True,True
1,False,True,False
2,False,True,False
3,False,True,False
4,False,True,False
