## What is Pandas
In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

### how to install
If you have Python and PIP already installed on a system, then installation of NumPy is very easy. Install it using this command:
- pip install numpy

after install numpy then import it by following code


In [39]:
import pandas as pd

In [40]:
print(pd.__version__)

0.25.1


### Core components of pandas: Series and DataFrames
The primary two components of pandas are the Series and DataFrame. A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean

### Creating DataFrames from scratch
Creating DataFrames right in Python is good to know and quite useful when testing new methods and functions.
There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.
note:<b>The Index of this DataFrame was given to us on creation as the numbers , but we could also create our own when we initialize the DataFrame</b>

In [41]:
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}
df=pd.DataFrame(data)
df

Unnamed: 0,apples,oranges
0,3,0
1,2,3
2,0,7
3,1,2


In [42]:
df=pd.DataFrame(data,index=[['sudhan', 'pitamber', 'manish', 'kiran']])
df

Unnamed: 0,apples,oranges
sudhan,3,0
pitamber,2,3
manish,0,7
kiran,1,2


In [43]:
df.loc['sudhan'] # 


Unnamed: 0,apples,oranges
sudhan,3,0


## read in data

In [44]:
csv_data=pd.read_csv('data.csv')
csv_data

Unnamed: 0.1,Unnamed: 0,apples,oranges
0,sudhan,3,0
1,pitamber,2,3
2,manish,0,7
3,kiran,1,2


### Converting back to a CSV
So after extensive work on cleaning your data, you’re now ready to save it as a file of your choice. Similar to the ways we read in data, pandas provides intuitive commands to save it

In [45]:
csv_data.to_csv('clean_data.csv')

### DataFrame operations
- Data Viewing
- Getting information about Data
- Handling duplicates
- Column cleanup

### Data Viewing
The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with .head()
- .head() outputs the first five rows of your DataFrame by default, but we could also pass a number as well: df.head(10) would output the top ten rows
- To see the last five rows use .tail(). tail() also accepts a number

In [108]:
dataframe=pd.read_csv('titanic_train.csv')
dataframe.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [109]:
dataframe.tail(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


### Getting info about your data
- info() provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using
- Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns)

In [110]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [111]:
dataframe.shape

(891, 12)

### Handling duplicates
- drop_duplicates() method will also return a copy of your DataFrame, but this time with duplicates removed
- It's a little verbose to keep assigning DataFrames to the same variable like in this example. For this reason, pandas has the inplace keyword argument on many of its methods. Using inplace=True will modify the DataFrame object in place
- Another important argument for drop_duplicates() is keep, which has three possible options:
    - first: (default) Drop duplicates except for the first occurrence.
    - last: Drop duplicates except for the last occurrence.
    - False: Drop all duplicates.

In [112]:
dataframe=dataframe.append(dataframe)
print(dataframe.shape)
dataframe=dataframe.drop_duplicates()
print(dataframe.shape)
dataframe.head()

(1782, 12)
(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [113]:
dataframe.drop_duplicates(inplace=True,keep=False)

In [114]:
dataframe.shape

(891, 12)

### Column cleanup
Many times datasets will have verbose column names with symbols, upper and lowercase words, spaces, and typos. To make selecting data by column name easier we can spend a little time cleaning up their names
- We can use the rename() method to rename certain or all columns via a dict. We don't want parentheses

In [115]:
dataframe.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [116]:
dataframe.rename(columns={
        'PassengerId': 'PID'
    }, inplace=True)
dataframe.head(2)

Unnamed: 0,PID,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [117]:
dataframe.columns=[col.lower() for col in dataframe ]
dataframe.head(1)

Unnamed: 0,pid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


### How to work with missing values
When exploring data, you’ll most likely encounter missing or null values, which are essentially placeholders for non-existent values. Most commonly you'll see Python's None or NumPy's np.nan, each of which are handled differently in some situations
- isnull() returns a DataFrame where each cell is either True or False depending on that cell's null status
- sum() method return the sum of null value in each individual columns

In [118]:
dataframe.isnull().sum()

pid           0
survived      0
pclass        0
name          0
sex           0
age         177
sibsp         0
parch         0
ticket        0
fare          0
cabin       687
embarked      2
dtype: int64

### Removing null values
- dropna() this method will delete any row with at least a single null value, but it will return a new DataFrame without altering the original one. You could specify inplace=True in this method as well
- Other than just dropping rows, you can also drop columns with null values by setting axis=1:

In [119]:
dataframe.dropna()


Unnamed: 0,pid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [120]:
dataframe.isnull().sum()

pid           0
survived      0
pclass        0
name          0
sex           0
age         177
sibsp         0
parch         0
ticket        0
fare          0
cabin       687
embarked      2
dtype: int64

### Imputation
Imputation is a conventional feature engineering technique used to keep valuable data that have null values. There may be instances where dropping every row with a null value removes too big a chunk from your dataset, so instead we can impute that null with another value, usually the mean or the median of that column.

In [121]:
data_age=dataframe['age']
data_age.isnull().sum()

177

In [124]:
age_mean=data_age.mean()
data_age.fillna(age_mean , inplace=True)
data_age.isnull().sum()

0

### summary of data
- Using describe() on an entire DataFrame we can get a summary of the distribution of continuous variables
- value_counts() can tell us the frequency of all values in a column

In [129]:
dataframe.describe()

Unnamed: 0,pid,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.002015,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,29.699118,0.0,0.0,14.4542
75%,668.5,1.0,3.0,35.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [131]:
dataframe['age'].describe() #display the infrormation about age column

count    891.000000
mean      29.699118
std       13.002015
min        0.420000
25%       22.000000
50%       29.699118
75%       35.000000
max       80.000000
Name: age, dtype: float64

In [134]:
dataframe['age'].value_counts().head(10) # top 10 most frequent age

29.699118    177
24.000000     30
22.000000     27
18.000000     26
28.000000     25
30.000000     25
19.000000     25
21.000000     24
25.000000     23
36.000000     22
Name: age, dtype: int64

### Relationships between continuous variables
- By using the correlation method corr() we can generate the relationship between each continuous variable

In [135]:
dataframe.corr()

Unnamed: 0,pid,survived,pclass,age,sibsp,parch,fare
pid,1.0,-0.005007,-0.035144,0.033207,-0.057527,-0.001652,0.012658
survived,-0.005007,1.0,-0.338481,-0.069809,-0.035322,0.081629,0.257307
pclass,-0.035144,-0.338481,1.0,-0.331339,0.083081,0.018443,-0.5495
age,0.033207,-0.069809,-0.331339,1.0,-0.232625,-0.179191,0.091566
sibsp,-0.057527,-0.035322,0.083081,-0.232625,1.0,0.414838,0.159651
parch,-0.001652,0.081629,0.018443,-0.179191,0.414838,1.0,0.216225
fare,0.012658,0.257307,-0.5495,0.091566,0.159651,0.216225,1.0


### DataFrame slicing, selecting, extracting
- we can select column by mention name of cloumn inside square braces
- For rows, we have two options:
    - loc - locates by name
    - iloc- locates by numerical index

In [138]:
subset = dataframe[['pid', 'age']]
subset.head()

Unnamed: 0,pid,age
0,1,22.0
1,2,38.0
2,3,26.0
3,4,35.0
4,5,35.0


In [142]:
subset.loc[0]

pid     1.0
age    22.0
Name: 0, dtype: float64

#### Conditional selections


In [149]:
male=dataframe[dataframe['sex']=='male']
male.head(2)

Unnamed: 0,pid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [153]:
age=dataframe[dataframe['age']>30] #extrace date age above 30
age.head()

Unnamed: 0,pid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S
20,21,0,2,"Fynney, Mr. Joseph J",male,35.0,0,0,239865,26.0,,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S


In [154]:
dataframe[(dataframe['sex'] == 'male') & (dataframe['age'] >30)].head()

Unnamed: 0,pid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S
20,21,0,2,"Fynney, Mr. Joseph J",male,35.0,0,0,239865,26.0,,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0,D56,S


In [156]:
dataframe[dataframe['name'].isin(['McCarthy, Mr. Timothy J', 'Fynney, Mr. Joseph J'])]


Unnamed: 0,pid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
20,21,0,2,"Fynney, Mr. Joseph J",male,35.0,0,0,239865,26.0,,S


### Applying functions
It is possible to iterate over a DataFrame or Series as you would with a list, but doing so — especially on large datasets — is very slow. An efficient alternative is to apply() a function to the dataset


In [160]:
def age_category(x):
    if x < 15:
        return "child"
    elif x>15<50:
        return "young"
    else:
        return 'old'

In [161]:
dataframe['agecategory']=dataframe['age'].apply(age_category)

In [163]:
dataframe.head()

Unnamed: 0,pid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,agecategory
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,young
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,young
5,6,0,3,"Moran, Mr. James",male,29.699118,0,0,330877,8.4583,,Q,young
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,young
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,child
