Pandas is an open source data analysis library written in Python. It uses the power & speed of numpy to make data analysis and preprocessing easy. It provides rich and highly robust data operations.

Pandas has 2 types of data structures:
1. DataFrame:
- Tablular spreadsheet representing rows, each of which contains one/multiple columns. Every column has the same datatype.
- 2D labeled structure with columns of potentially different types of data.

2. Series: 
- 1D array with indexes, stores a single column/row of data in a DataFrame. Has 1 uniform datatype.
- 1D labeled array capable of holding any type of data.

In [2]:
import numpy as np
import pandas as pd

In [3]:
dict1 = {
    "name": ["Alice", "Bob", "Charlie"],
    "marks": [28, 34, 24],
    "city": ["NYC", "Las Vegas", "Chicago"]
}

In [4]:
df = pd.DataFrame(dict1)
# Create a DataFrame from the dictionary, like an Excel table
df

Unnamed: 0,name,marks,city
0,Alice,28,NYC
1,Bob,34,Las Vegas
2,Charlie,24,Chicago


In [5]:
df.to_csv('friends.csv')  # Save the DataFrame to a CSV file

In [6]:
df.to_csv('friends_without_index.csv', index=False) 
# Save the DataFrame to a CSV file without the index

In [7]:
df.head(2)  # Display the first two rows of the DataFrame

Unnamed: 0,name,marks,city
0,Alice,28,NYC
1,Bob,34,Las Vegas


In [8]:
df.tail(2)  # Display the last two rows of the DataFrame

Unnamed: 0,name,marks,city
1,Bob,34,Las Vegas
2,Charlie,24,Chicago


In [9]:
df.describe() # Display a summary of the DataFrame
#performs statistical analysis of the the numerical columns

Unnamed: 0,marks
count,3.0
mean,28.666667
std,5.033223
min,24.0
25%,26.0
50%,28.0
75%,31.0
max,34.0


In [10]:
file = pd.read_csv('file.csv')
# Read a CSV file into a DataFrame 

In [11]:
file

Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Train No.,Speed,City
0,0,0,0,12332,50,NYC
1,1,1,1,12453,84,Las Vegas
2,2,2,2,12432,76,Chicago


In [12]:
file['Speed']  # Access the 'speed' column of the DataFrame

0    50
1    84
2    76
Name: Speed, dtype: int64

In [13]:
file['Speed'][0]  # Access the first element of the 'speed' column

50

In [14]:
file['Speed'][0:5]  # Access the first five rows of the 'speed' column

0    50
1    84
2    76
Name: Speed, dtype: int64

In [15]:
file['Speed'][0] = 50

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  file['Speed'][0] = 50


In [16]:
file

Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Train No.,Speed,City
0,0,0,0,12332,50,NYC
1,1,1,1,12453,84,Las Vegas
2,2,2,2,12432,76,Chicago


In [17]:
file.to_csv('file.csv') #updating the file with the new value

In [18]:
file.index = ['first', 'second', 'third']

In [19]:
file

Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Train No.,Speed,City
first,0,0,0,12332,50,NYC
second,1,1,1,12453,84,Las Vegas
third,2,2,2,12432,76,Chicago


Exploring Series

In [20]:
ser = pd.Series(np.random.rand)
# Creating a Series with random numbers

In [21]:
ser # Displaying the Series

0    <built-in method rand of numpy.random.mtrand.R...
dtype: object

In [22]:
ser = pd.Series(np.random.rand(34))
ser
# Creating a Series with 34 random numbers

0     0.551471
1     0.611251
2     0.585310
3     0.009957
4     0.471961
5     0.884017
6     0.437245
7     0.321148
8     0.336249
9     0.318365
10    0.121300
11    0.881071
12    0.039457
13    0.689906
14    0.255672
15    0.387883
16    0.300262
17    0.882043
18    0.432883
19    0.196304
20    0.724168
21    0.830916
22    0.392806
23    0.830922
24    0.646969
25    0.368922
26    0.973301
27    0.320075
28    0.073442
29    0.570977
30    0.757685
31    0.943224
32    0.072896
33    0.116659
dtype: float64

In [23]:
type(ser)  # Display the type of the Series

pandas.core.series.Series

Exploring DataFrame

In [24]:
newdf = pd.DataFrame(np.random.rand(334,5), index=np.arange(334))
# Creating a DataFrame with 334 rows and 5 columns of random numbers

In [25]:
newdf

Unnamed: 0,0,1,2,3,4
0,0.728944,0.743868,0.629590,0.899579,0.492873
1,0.144301,0.192611,0.552742,0.774814,0.463063
2,0.422154,0.382449,0.788449,0.269252,0.001662
3,0.965316,0.454503,0.779935,0.719568,0.471760
4,0.418825,0.923470,0.211730,0.754540,0.568240
...,...,...,...,...,...
329,0.536941,0.303951,0.910985,0.446804,0.305244
330,0.618177,0.135213,0.098617,0.307283,0.192841
331,0.697051,0.629480,0.986865,0.488802,0.195982
332,0.144277,0.406587,0.864793,0.630887,0.375024


In [26]:
newdf.head(6)

Unnamed: 0,0,1,2,3,4
0,0.728944,0.743868,0.62959,0.899579,0.492873
1,0.144301,0.192611,0.552742,0.774814,0.463063
2,0.422154,0.382449,0.788449,0.269252,0.001662
3,0.965316,0.454503,0.779935,0.719568,0.47176
4,0.418825,0.92347,0.21173,0.75454,0.56824
5,0.669157,0.303465,0.884063,0.10941,0.60482


In [27]:
type(newdf)  # Display the type of the DataFrame

pandas.core.frame.DataFrame

In [28]:
newdf.describe()  # Display a statistical summary of the DataFrame

Unnamed: 0,0,1,2,3,4
count,334.0,334.0,334.0,334.0,334.0
mean,0.513554,0.535911,0.510367,0.470264,0.529533
std,0.286755,0.29466,0.293269,0.27636,0.278897
min,0.00295,0.002209,0.00113,0.00058,0.000634
25%,0.271971,0.281968,0.254766,0.226901,0.293006
50%,0.524087,0.561769,0.520845,0.447288,0.522479
75%,0.746323,0.803017,0.774716,0.680254,0.761063
max,0.993485,0.998447,0.998541,0.995006,0.999965


In [29]:
newdf.dtypes  # Display the data types of each column in the DataFrame

0    float64
1    float64
2    float64
3    float64
4    float64
dtype: object

In [30]:
newdf.index

Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
       ...
       324, 325, 326, 327, 328, 329, 330, 331, 332, 333],
      dtype='int32', length=334)

In [31]:
newdf.columns # Display the column names of the DataFrame

RangeIndex(start=0, stop=5, step=1)

In [32]:
newdf.to_numpy()  # Convert the DataFrame to a NumPy array

array([[0.72894368, 0.7438677 , 0.62959031, 0.89957887, 0.49287303],
       [0.14430072, 0.19261095, 0.55274171, 0.77481382, 0.4630627 ],
       [0.42215352, 0.38244866, 0.78844873, 0.26925176, 0.00166202],
       ...,
       [0.69705073, 0.62948   , 0.98686474, 0.48880217, 0.19598194],
       [0.14427723, 0.40658699, 0.86479346, 0.63088652, 0.37502395],
       [0.86916341, 0.56956142, 0.25538129, 0.51227057, 0.65091227]])

In [33]:
newdf[0][0] = 0.3

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf[0][0] = 0.3


In [34]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.3,0.743868,0.62959,0.899579,0.492873
1,0.144301,0.192611,0.552742,0.774814,0.463063
2,0.422154,0.382449,0.788449,0.269252,0.001662
3,0.965316,0.454503,0.779935,0.719568,0.47176
4,0.418825,0.92347,0.21173,0.75454,0.56824


In [35]:
newdf.T # Transpose the DataFrame

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,324,325,326,327,328,329,330,331,332,333
0,0.3,0.144301,0.422154,0.965316,0.418825,0.669157,0.859307,0.656442,0.502271,0.961511,...,0.373131,0.19744,0.82855,0.703502,0.683283,0.536941,0.618177,0.697051,0.144277,0.869163
1,0.743868,0.192611,0.382449,0.454503,0.92347,0.303465,0.061184,0.494598,0.870533,0.078582,...,0.251322,0.07425,0.406425,0.448883,0.189856,0.303951,0.135213,0.62948,0.406587,0.569561
2,0.62959,0.552742,0.788449,0.779935,0.21173,0.884063,0.562904,0.050291,0.707998,0.234567,...,0.163093,0.339265,0.388448,0.599052,0.308377,0.910985,0.098617,0.986865,0.864793,0.255381
3,0.899579,0.774814,0.269252,0.719568,0.75454,0.10941,0.984806,0.128328,0.323279,0.173365,...,0.94753,0.155212,0.220778,0.728384,0.023889,0.446804,0.307283,0.488802,0.630887,0.512271
4,0.492873,0.463063,0.001662,0.47176,0.56824,0.60482,0.072186,0.273089,0.404738,0.264574,...,0.833087,0.788935,0.159345,0.800599,0.431205,0.305244,0.192841,0.195982,0.375024,0.650912


In [36]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.3,0.743868,0.62959,0.899579,0.492873
1,0.144301,0.192611,0.552742,0.774814,0.463063
2,0.422154,0.382449,0.788449,0.269252,0.001662
3,0.965316,0.454503,0.779935,0.719568,0.47176
4,0.418825,0.92347,0.21173,0.75454,0.56824


In [37]:
newdf.sort_index(axis=0, ascending=False)
# Sort the DataFrame by index in descending order
#axis=0 - rows
#axis=1 - columns

Unnamed: 0,0,1,2,3,4
333,0.869163,0.569561,0.255381,0.512271,0.650912
332,0.144277,0.406587,0.864793,0.630887,0.375024
331,0.697051,0.629480,0.986865,0.488802,0.195982
330,0.618177,0.135213,0.098617,0.307283,0.192841
329,0.536941,0.303951,0.910985,0.446804,0.305244
...,...,...,...,...,...
4,0.418825,0.923470,0.211730,0.754540,0.568240
3,0.965316,0.454503,0.779935,0.719568,0.471760
2,0.422154,0.382449,0.788449,0.269252,0.001662
1,0.144301,0.192611,0.552742,0.774814,0.463063


In [38]:
newdf[0]  #gives me a series

0      0.300000
1      0.144301
2      0.422154
3      0.965316
4      0.418825
         ...   
329    0.536941
330    0.618177
331    0.697051
332    0.144277
333    0.869163
Name: 0, Length: 334, dtype: float64

In [39]:
type(newdf[0])  # Display the type of the Series

pandas.core.series.Series

In [40]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.3,0.743868,0.62959,0.899579,0.492873
1,0.144301,0.192611,0.552742,0.774814,0.463063
2,0.422154,0.382449,0.788449,0.269252,0.001662
3,0.965316,0.454503,0.779935,0.719568,0.47176
4,0.418825,0.92347,0.21173,0.75454,0.56824


In [41]:
newdf2 = newdf

In [42]:
newdf2[0][0] = 978

In [43]:
newdf
#[0][0] = 978, because newdf2 is a reference/view to newdf
# If you want to create a copy of the DataFrame, use the copy() method

Unnamed: 0,0,1,2,3,4
0,978.000000,0.743868,0.629590,0.899579,0.492873
1,0.144301,0.192611,0.552742,0.774814,0.463063
2,0.422154,0.382449,0.788449,0.269252,0.001662
3,0.965316,0.454503,0.779935,0.719568,0.471760
4,0.418825,0.923470,0.211730,0.754540,0.568240
...,...,...,...,...,...
329,0.536941,0.303951,0.910985,0.446804,0.305244
330,0.618177,0.135213,0.098617,0.307283,0.192841
331,0.697051,0.629480,0.986865,0.488802,0.195982
332,0.144277,0.406587,0.864793,0.630887,0.375024


In [44]:
newdf2 = newdf.copy()
#newdf2 = newdf[:]
# Create a copy of the DataFrame

In [45]:
newdf2[0][0]=9783

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf2[0][0]=9783


In [46]:
newdf  #no changes as newdf2 is a copy of newdf

Unnamed: 0,0,1,2,3,4
0,978.000000,0.743868,0.629590,0.899579,0.492873
1,0.144301,0.192611,0.552742,0.774814,0.463063
2,0.422154,0.382449,0.788449,0.269252,0.001662
3,0.965316,0.454503,0.779935,0.719568,0.471760
4,0.418825,0.923470,0.211730,0.754540,0.568240
...,...,...,...,...,...
329,0.536941,0.303951,0.910985,0.446804,0.305244
330,0.618177,0.135213,0.098617,0.307283,0.192841
331,0.697051,0.629480,0.986865,0.488802,0.195982
332,0.144277,0.406587,0.864793,0.630887,0.375024


During chaining and equating it with a value, pandas may return either view or return which is unreliable.
Therefore we set a value using .loc function.

In [47]:
newdf.loc[0][0] = 654

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  newdf.loc[0][0] = 654


In [48]:
newdf.head(3)

Unnamed: 0,0,1,2,3,4
0,654.0,0.743868,0.62959,0.899579,0.492873
1,0.144301,0.192611,0.552742,0.774814,0.463063
2,0.422154,0.382449,0.788449,0.269252,0.001662


In [49]:
newdf.columns = list("ABCDE")
# Rename the columns of the DataFrame to A, B, C, D, E

In [50]:
newdf.head(2)

Unnamed: 0,A,B,C,D,E
0,654.0,0.743868,0.62959,0.899579,0.492873
1,0.144301,0.192611,0.552742,0.774814,0.463063


In [53]:
newdf.loc[0,'A'] = 657
newdf.head(3)

Unnamed: 0,A,B,C,D,E
0,657.0,0.743868,0.62959,0.899579,0.492873
1,0.144301,0.192611,0.552742,0.774814,0.463063
2,0.422154,0.382449,0.788449,0.269252,0.001662


In [54]:
newdf.loc[0,0] = 640
newdf.head(2)

Unnamed: 0,A,B,C,D,E,0
0,657.0,0.743868,0.62959,0.899579,0.492873,640.0
1,0.144301,0.192611,0.552742,0.774814,0.463063,


In [64]:
newdf = newdf.drop(0, axis=1)
#drop the extra column 0

In [65]:
newdf.loc[[1,2], ['C', 'D']]
# Access specific rows and columns using loc

Unnamed: 0,C,D
1,0.552742,0.774814
2,0.788449,0.269252


In [66]:
newdf.loc[:, ['C', 'D']]
# Access all rows for columns C and D

Unnamed: 0,C,D
0,0.629590,0.899579
1,0.552742,0.774814
2,0.788449,0.269252
3,0.779935,0.719568
4,0.211730,0.754540
...,...,...
329,0.910985,0.446804
330,0.098617,0.307283
331,0.986865,0.488802
332,0.864793,0.630887


In [None]:
newdf.loc[[1,2], :]
# Access specific rows for all columns

Unnamed: 0,A,B,C,D,E
1,0.144301,0.192611,0.552742,0.774814,0.463063
2,0.422154,0.382449,0.788449,0.269252,0.001662


In [69]:
newdf.loc[(newdf['A'] < 0.3)]
# Access rows where the value in column 0 is less than 0.3

Unnamed: 0,A,B,C,D,E
1,0.144301,0.192611,0.552742,0.774814,0.463063
13,0.088655,0.046739,0.592366,0.108689,0.855389
16,0.033015,0.125724,0.201869,0.944233,0.908967
25,0.125060,0.227928,0.931798,0.420404,0.766975
28,0.009249,0.342140,0.635581,0.110425,0.446664
...,...,...,...,...,...
319,0.173156,0.777196,0.167584,0.728418,0.021629
321,0.090172,0.844035,0.742834,0.449366,0.512191
322,0.032174,0.501088,0.872086,0.163660,0.910567
325,0.197440,0.074250,0.339265,0.155212,0.788935


In [70]:
newdf.loc[(newdf['A'] < 0.3) & (newdf['C'] > 0.1)]
# Access rows where the value in column 0 is less than 0.3

Unnamed: 0,A,B,C,D,E
1,0.144301,0.192611,0.552742,0.774814,0.463063
13,0.088655,0.046739,0.592366,0.108689,0.855389
16,0.033015,0.125724,0.201869,0.944233,0.908967
25,0.125060,0.227928,0.931798,0.420404,0.766975
28,0.009249,0.342140,0.635581,0.110425,0.446664
...,...,...,...,...,...
319,0.173156,0.777196,0.167584,0.728418,0.021629
321,0.090172,0.844035,0.742834,0.449366,0.512191
322,0.032174,0.501088,0.872086,0.163660,0.910567
325,0.197440,0.074250,0.339265,0.155212,0.788935


In [72]:
newdf.iloc[0,4]
# Access the first row and fifth column using iloc
#iloc is used for integer-location based indexing
#loc is used for label-based indexing 

0.4928730285245344

In [75]:
newdf.iloc[[0,5], [1,2]]

Unnamed: 0,B,C
0,0.743868,0.62959
5,0.303465,0.884063


In [76]:
newdf.head(3)

Unnamed: 0,A,B,C,D,E
0,657.0,0.743868,0.62959,0.899579,0.492873
1,0.144301,0.192611,0.552742,0.774814,0.463063
2,0.422154,0.382449,0.788449,0.269252,0.001662


In [None]:
newdf.drop([0])
# Drop the first row of the DataFrame

Unnamed: 0,A,B,C,D,E
1,0.144301,0.192611,0.552742,0.774814,0.463063
2,0.422154,0.382449,0.788449,0.269252,0.001662
3,0.965316,0.454503,0.779935,0.719568,0.471760
4,0.418825,0.923470,0.211730,0.754540,0.568240
5,0.669157,0.303465,0.884063,0.109410,0.604820
...,...,...,...,...,...
329,0.536941,0.303951,0.910985,0.446804,0.305244
330,0.618177,0.135213,0.098617,0.307283,0.192841
331,0.697051,0.629480,0.986865,0.488802,0.195982
332,0.144277,0.406587,0.864793,0.630887,0.375024


In [None]:
newdf.drop(['A'], axis=1)
# Drop the first column of the DataFrame

Unnamed: 0,B,C,D,E
0,0.743868,0.629590,0.899579,0.492873
1,0.192611,0.552742,0.774814,0.463063
2,0.382449,0.788449,0.269252,0.001662
3,0.454503,0.779935,0.719568,0.471760
4,0.923470,0.211730,0.754540,0.568240
...,...,...,...,...
329,0.303951,0.910985,0.446804,0.305244
330,0.135213,0.098617,0.307283,0.192841
331,0.629480,0.986865,0.488802,0.195982
332,0.406587,0.864793,0.630887,0.375024


In [83]:
newdf.drop(['A', 'D'], axis=1)
# Drop the first and fourth columns of the DataFrame

Unnamed: 0,B,C,E
0,0.743868,0.629590,0.492873
1,0.192611,0.552742,0.463063
2,0.382449,0.788449,0.001662
3,0.454503,0.779935,0.471760
4,0.923470,0.211730,0.568240
...,...,...,...
329,0.303951,0.910985,0.305244
330,0.135213,0.098617,0.192841
331,0.629480,0.986865,0.195982
332,0.406587,0.864793,0.375024


In [86]:
newdf.drop([1,5], axis=0, inplace=True)
#inplace=True modifies the DataFrame in place 
# without returning a new DataFrame

#modifies the original DataFrame

In [87]:
newdf

Unnamed: 0,B,C,E
0,0.743868,0.629590,0.492873
2,0.382449,0.788449,0.001662
3,0.454503,0.779935,0.471760
4,0.923470,0.211730,0.568240
6,0.061184,0.562904,0.072186
...,...,...,...
329,0.303951,0.910985,0.305244
330,0.135213,0.098617,0.192841
331,0.629480,0.986865,0.195982
332,0.406587,0.864793,0.375024


In [91]:
newdf.reset_index()
# Reset the index of the DataFrame
#but adds the old index as a new column

Unnamed: 0,index,B,C,E
0,0,0.743868,0.629590,0.492873
1,2,0.382449,0.788449,0.001662
2,3,0.454503,0.779935,0.471760
3,4,0.923470,0.211730,0.568240
4,6,0.061184,0.562904,0.072186
...,...,...,...,...
327,329,0.303951,0.910985,0.305244
328,330,0.135213,0.098617,0.192841
329,331,0.629480,0.986865,0.195982
330,332,0.406587,0.864793,0.375024


In [92]:
newdf.reset_index(drop=True, inplace=True)
# Reset the index of the DataFrame and 
# drop the old index column

In [93]:
newdf.head(4)

Unnamed: 0,B,C,E
0,0.743868,0.62959,0.492873
1,0.382449,0.788449,0.001662
2,0.454503,0.779935,0.47176
3,0.92347,0.21173,0.56824


In [95]:
newdf['B'].isnull()
# Check for null values in column 'B'

0      False
1      False
2      False
3      False
4      False
       ...  
327    False
328    False
329    False
330    False
331    False
Name: B, Length: 332, dtype: bool

In [None]:
newdf['B'] = None
# Set all values in column 'B' to None (null)

In [97]:
newdf.head(2)

Unnamed: 0,B,C,E
0,,0.62959,0.492873
1,,0.788449,0.001662


In [98]:
newdf['B'].isnull()
# Check for null values in column 'B'

0      True
1      True
2      True
3      True
4      True
       ... 
327    True
328    True
329    True
330    True
331    True
Name: B, Length: 332, dtype: bool

In [102]:
#using loc = better method
newdf.loc[:, ['B']] = 34

In [103]:
newdf

Unnamed: 0,B,C,E
0,34,0.629590,0.492873
1,34,0.788449,0.001662
2,34,0.779935,0.471760
3,34,0.211730,0.568240
4,34,0.562904,0.072186
...,...,...,...
327,34,0.910985,0.305244
328,34,0.098617,0.192841
329,34,0.986865,0.195982
330,34,0.864793,0.375024


In [110]:
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Alfred'],
                   "toy": [np.nan, 'Batmobile', 'Bullwhip'],
                   "born": [pd.NaT, pd.Timestamp('1940-04-25'), pd.NaT]})

#create a DataFrame with NaN and NaT values

In [106]:
df.head()

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [107]:
df.dropna()
# Drop rows with any NaN values

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


In [109]:
df.dropna(how='all', axis=1)
# Drops the entire column if all values are NaN

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [None]:
df.drop_duplicates
#no effect, just returns the method

<bound method DataFrame.drop_duplicates of      name        toy       born
0  Alfred        NaN        NaT
1  Batman  Batmobile 1940-04-25
2  Alfred   Bullwhip        NaT>

In [113]:
df.drop_duplicates(subset = ['name'])
#uses subset to specify which columns to consider 
# for identifying duplicates

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25


In [115]:
df.drop_duplicates(subset = ['name'], keep='last')
# Keep the last occurrence of each duplicate row

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25
2,Alfred,Bullwhip,NaT


In [118]:
df.drop_duplicates(subset = ['name'], keep=False)
# Drop all duplicate rows, keeping none

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


In [119]:
df.shape

(3, 3)

In [131]:
df.describe()

Unnamed: 0,born
count,1
mean,1940-04-25 00:00:00
min,1940-04-25 00:00:00
25%,1940-04-25 00:00:00
50%,1940-04-25 00:00:00
75%,1940-04-25 00:00:00
max,1940-04-25 00:00:00


In [None]:
df.info
# Display information about the DataFrame, 
# including the number of entries and data types

<bound method DataFrame.info of      name        toy       born
0  Alfred        NaN        NaT
1  Batman  Batmobile 1940-04-25
2  Alfred   Bullwhip        NaT>

In [127]:
df['toy'].value_counts(dropna=False)
# Count the occurrences of each unique value 
# in the 'name' column
#includes NaN values in the count

toy
NaN          1
Batmobile    1
Bullwhip     1
Name: count, dtype: int64

In [128]:
df['toy'].value_counts(dropna=True)
#does not include NaN values in the count

toy
Batmobile    1
Bullwhip     1
Name: count, dtype: int64

In [129]:
df.notnull()
# Check for non-null values in the DataFrame

Unnamed: 0,name,toy,born
0,True,False,False
1,True,True,True
2,True,True,False


In [130]:
df.isnull()
# Check for null values in the DataFrame

Unnamed: 0,name,toy,born
0,False,True,True
1,False,False,False
2,False,False,True


In [132]:
#Creating a dataframe with only integer values
#with 3 rows and 2 columns

data = {
    'A': [10, 20, 30],
    'B': [20, 40, 60]
}

In [133]:
df2 = pd.DataFrame(data)
# Display the DataFrame

In [134]:
df2.head()

Unnamed: 0,A,B
0,10,20
1,20,40
2,30,60


In [135]:
df2.describe()
# Display a statistical summary of the DataFrame

Unnamed: 0,A,B
count,3.0,3.0
mean,20.0,40.0
std,10.0,20.0
min,10.0,20.0
25%,15.0,30.0
50%,20.0,40.0
75%,25.0,50.0
max,30.0,60.0


In [137]:
df2.mean()
# Calculate the mean of each column in the DataFrame

A    20.0
B    40.0
dtype: float64

In [139]:
df2.corr()
# Calculate the correlation between columns in the DataFrame

Unnamed: 0,A,B
A,1.0,1.0
B,1.0,1.0


In [141]:
df2.count()
# Count the number of non-null values in each column of the DataFrame

A    3
B    3
dtype: int64

In [142]:
df2.max()

A    30
B    60
dtype: int64

In [143]:
df2.min()

A    10
B    20
dtype: int64

In [144]:
df2.median()

A    20.0
B    40.0
dtype: float64

In [146]:
df2.std()

A    10.0
B    20.0
dtype: float64

In [148]:
#reading data from excel file
newdata = pd.read_excel('excel_data.xlsx') 

In [None]:
newdata #reads sheet 1 by default

Unnamed: 0,Train No.,Speed,City
0,12332,50,NYC
1,12453,84,Las Vegas
2,12432,76,Chicago


In [152]:
newdata = pd.read_excel('excel_data.xlsx', sheet_name='Sheet2') 
#reads a specific sheet by name
newdata

Unnamed: 0,Train No.S2,SpeedS2,CityS2
0,12327,52,LA
1,12346,64,Kansas
2,12358,70,Texas


In [154]:
newdata.iloc[0,0] = 12335
newdata

Unnamed: 0,Train No.S2,SpeedS2,CityS2
0,12335,52,LA
1,12346,64,Kansas
2,12358,70,Texas


In [None]:
newdata.to_excel('excel_data.xlsx', sheet_name='Sheet2')
# Save the DataFrame to an Excel file, 
# overwriting the existing sheet with new changes

#removes sheet 1