<p align="center">
<img src="https://www.senenews.com/wp-content/uploads/2018/02/aims-3.png">
</p>

What is pandas? \\


*   Python library used for working with data sets.
*   It has functions for analyzing, cleaning, exploring, and manipulating data.

Key Features of Pandas:


*   Fast and efficient DataFrame object with default and customized indexing.
*   Tools for loading data into in-memory data objects from different file formats.
*   Data alignment and integrated handling of missing data.
*   Reshaping and pivoting of date sets.
*   Label-based slicing, indexing and subsetting of large data sets.
*   Columns from a data structure can be deleted or inserted.
*   Group by data for aggregation and transformations.
*   High performance merging and joining of data.









To use pandas you need to install in local machine : 
`
!pip install pandas 
`

In [None]:
# Installation of pandas
!pip install pandas



alias (as) : In Python alias are an alternate name for referring to the same thing.

In [None]:
# Import pandas as pd
import pandas as pd

In some project you have to use an specifique version of your library .  to check the Pandas Version : ```
pd.__version__ ``` 

In [None]:
pd.__version__

'1.3.5'

# Introducing Pandas Objects

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.
As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are.
Thus, before we go any further, let's introduce these two fundamental Pandas data structures: the ``Series``, ``DataFrame``.

We will start our code sessions with the standard NumPy and Pandas imports:

In [None]:
import numpy as np
import pandas as pd

## The Pandas Series Object

A Pandas ``Series`` is a one-dimensional array of indexed data.
It can be created from a list or array as follows:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the output, the ``Series`` wraps both a sequence of values and a sequence of indices, which we can access with the ``values`` and ``index`` attributes.
The ``values`` are simply a familiar NumPy array:

In [None]:
# array of the values
data.values

array([0.25, 0.5 , 0.75, 1.  ])

The ``index`` is an array-like object of type ``pd.Index``, which we'll discuss in more detail momentarily.

In [None]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [None]:
data[1]

0.5

In [None]:
data[1:3]

1    0.50
2    0.75
dtype: float64

As we will see, though, the Pandas ``Series`` is much more general and flexible than the one-dimensional NumPy array that it emulates.

### ``Series`` as generalized NumPy array

From what we've seen so far, it may look like the ``Series`` object is basically interchangeable with a one-dimensional NumPy array.
The essential difference is the presence of the index: while the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.

This explicit index definition gives the ``Series`` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
For example, if we wish, we can use strings as an index:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

And the item access works as expected:

In [None]:
data['b']

0.5

We can even use non-contiguous or non-sequential indices:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

NameError: ignored

In [None]:
data[5]

0.5

### Series as specialized dictionary

The ``Series``-as-dictionary analogy can be made even more clear by constructing a ``Series`` object directly from a Python dictionary:

In [None]:
population_dict = {'Senegal': 38332521,
                   'Cameroun': 26448193,
                   'Mali': 19651127,
                   'Soudan': 19552860,
                   'Ghana': 12882135}
population = pd.Series(population_dict)
population

Senegal     38332521
Cameroun    26448193
Mali        19651127
Soudan      19552860
Ghana       12882135
dtype: int64

By default, a ``Series`` will be created where the index is drawn from the sorted keys.
From here, typical dictionary-style item access can be performed:

In [None]:
population['Senegal']

38332521

Unlike a dictionary, though, the ``Series`` also supports array-style operations such as slicing:

In [None]:
population['Senegal':'Soudan']

Senegal     38332521
Cameroun    26448193
Mali        19651127
Soudan      19552860
dtype: int64

### Constructing Series objects

We've already seen a few ways of constructing a Pandas ``Series`` from scratch; all of them are some version of the following:

```python
>>> pd.Series(data, index=index)
```

where ``index`` is an optional argument, and ``data`` can be one of many entities.

For example, ``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence:

In [None]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

``data`` can be a scalar, which is repeated to fill the specified index:

In [None]:
# broadcasting
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

``data`` can be a dictionary, in which ``index`` defaults to the sorted dictionary keys:

In [None]:
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In each case, the index can be explicitly set if a different result is preferred:

In [None]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

Notice that in this case, the ``Series`` is populated only with the explicitly identified keys.

### The Pandas DataFrame Object

The next fundamental structure in Pandas is the ``DataFrame``.
Like the ``Series`` object discussed in the previous section, the ``DataFrame`` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.
We'll now take a look at each of these perspectives.

### DataFrame as a generalized NumPy array
If a ``Series`` is an analog of a one-dimensional array with flexible indices, a ``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names.
Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a ``DataFrame`` as a sequence of aligned ``Series`` objects.
Here, by "aligned" we mean that they share the same index.

To demonstrate this, let's first construct a new ``Series`` listing the area of each of the five states discussed in the previous section:

In [None]:
area_dict = {'Senegal': 423967, 'Cameroun': 695662, 'Mali': 141297,
             'Sudan': 170312, 'Ghana': 149995}
area = pd.Series(area_dict)
area

Senegal     423967
Cameroun    695662
Mali        141297
Sudan       170312
Ghana       149995
dtype: int64

Now that we have this along with the ``population`` Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [None]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
Cameroun,26448193.0,695662.0
Ghana,12882135.0,149995.0
Mali,19651127.0,141297.0
Senegal,38332521.0,423967.0
Soudan,19552860.0,
Sudan,,170312.0


Like the ``Series`` object, the ``DataFrame`` has an ``index`` attribute that gives access to the index labels:

In [None]:
states.index

Index(['Cameroun', 'Ghana', 'Mali', 'Senegal', 'Soudan', 'Sudan'], dtype='object')

Additionally, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [None]:
states.columns

Index(['population', 'area'], dtype='object')

the ``DataFrame`` can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

### DataFrame as specialized dictionary

Similarly, we can also think of a ``DataFrame`` as a specialization of a dictionary.
Where a dictionary maps a key to a value, a ``DataFrame`` maps a column name to a ``Series`` of column data.
For example, asking for the ``'area'`` attribute returns the ``Series`` object containing the areas we saw earlier:

In [None]:
states['area']

Cameroun    695662.0
Ghana       149995.0
Mali        141297.0
Senegal     423967.0
Soudan           NaN
Sudan       170312.0
Name: area, dtype: float64

### Constructing DataFrame objects

A Pandas ``DataFrame`` can be constructed in a variety of ways.
Here we'll give several examples.

#### From a single Series object

A ``DataFrame`` is a collection of ``Series`` objects, and a single-column ``DataFrame`` can be constructed from a single ``Series``:

In [None]:
pd.DataFrame(population,columns=['population'])

Unnamed: 0,population
Senegal,38332521
Cameroun,26448193
Mali,19651127
Soudan,19552860
Ghana,12882135


#### From a list of dicts

Any list of dictionaries can be made into a ``DataFrame``.
We'll use a simple list comprehension to create some data:

In [None]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:

In [None]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### From a dictionary of Series objects

As we saw before, a ``DataFrame`` can be constructed from a dictionary of ``Series`` objects as well:

In [None]:
pd.DataFrame({'population': population,'area': area})

Unnamed: 0,population,area
Cameroun,26448193.0,695662.0
Ghana,12882135.0,149995.0
Mali,19651127.0,141297.0
Senegal,38332521.0,423967.0
Soudan,19552860.0,
Sudan,,170312.0


#### From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
If omitted, an integer index will be used for each:

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.504173,0.315757
b,0.296157,0.943268
c,0.215002,0.236278


### Reading data files

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:

```
Product A,Product B,Product C,
30,21,9,
35,34,1,
41,11,11
```

So a CSV file is a table of values separated by commas. Hence the name: "Comma-Separated Values", or CSV.

Let's now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We'll use the `pd.read_csv()` function to read the data into a DataFrame. This goes thusly: `data = pd.read_csv("filename.csv")`

In [None]:
import pandas as pd 

In [None]:
data = pd.read_csv("/content/sample_data/california_housing_train.csv")

We can use the `shape` attribute to check how large the resulting DataFrame is:

In [None]:
data.shape

(17000, 9)

So our new DataFrame has 17,000 records split across 9 different columns.

We can examine the contents of the resultant DataFrame using the `head()` command, which grabs the first five rows:

In [None]:
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


How we can save the file in our local machine ??

*** Exercice*** 😊

In [None]:
data.to_csv('Stevepandas',index=False)

# Indexing, Selecting & Assigning

Selecting specific values of a pandas DataFrame or Series to work on is an implicit step in almost any data operation you'll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively.

## Native accessors

Native Python objects provide  good ways of indexing data. Pandas carries all of these over, which helps make it easy to start with.

Consider this DataFrame:

In [None]:
import pandas as pd 

In [None]:
data = pd.read_csv("/content/sample_data/california_housing_train.csv")

NameError: ignored

In Python, we can access the property of an object by accessing it as an attribute. A `book` object, for example, might have a `title` property, which we can access by calling `book.title`. Columns in a pandas DataFrame work in much the same way. 

Hence to access the `median_house_value` property of `data` we can use:

In [None]:
data.median_house_value

0         66900.0
1         80100.0
2         85700.0
3         73400.0
4         65500.0
           ...   
16995    111400.0
16996     79000.0
16997    103600.0
16998     85800.0
16999     94600.0
Name: median_house_value, Length: 17000, dtype: float64

If we have a Python dictionary, we can access its values using the indexing (`[]`) operator. We can do the same with columns in a DataFrame:

In [None]:
data['median_house_value']# series
data.iloc[0]
#data

longitude              -114.3100
latitude                 34.1900
housing_median_age       15.0000
total_rooms            5612.0000
total_bedrooms         1283.0000
population             1015.0000
households              472.0000
median_income             1.4936
median_house_value    66900.0000
Name: 0, dtype: float64

In [None]:
data[['median_house_value']]# dataframe

Unnamed: 0,median_house_value
0,66900.0
1,80100.0
2,85700.0
3,73400.0
4,65500.0
...,...
16995,111400.0
16996,79000.0
16997,103600.0
16998,85800.0


These are the two ways of selecting a specific Series out of a DataFrame.

Doesn't a pandas Series look kind of like a fancy dictionary? It pretty much is, so it's no surprise that, to drill down to a single specific value, we need only use the indexing operator `[]` once more:

In [None]:
data['median_house_value'][0]

66900.0

## Indexing in pandas

The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, `loc` and `iloc`. For more advanced operations, these are the ones you're supposed to be using.

### Index-based selection

Pandas indexing works in one of two paradigms. The first is **index-based selection**: selecting data based on its numerical position in the data. `iloc` follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:

In [None]:
data.iloc[0]

longitude              -114.3100
latitude                 34.1900
housing_median_age       15.0000
total_rooms            5612.0000
total_bedrooms         1283.0000
population             1015.0000
households              472.0000
median_income             1.4936
median_house_value    66900.0000
Name: 0, dtype: float64

Both `loc` and `iloc` are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with `iloc`, we can do the following:

In [None]:
data.iloc[:,0]

0       -114.31
1       -114.47
2       -114.56
3       -114.57
4       -114.57
          ...  
16995   -124.26
16996   -124.27
16997   -124.30
16998   -124.30
16999   -124.35
Name: longitude, Length: 17000, dtype: float64

On its own, the `:` operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the `country` column from just the first, second, and third row, we would do:

In [None]:
data.iloc[:3,0]


0   -114.31
1   -114.47
2   -114.56
Name: longitude, dtype: float64

Or, to select just the second and third entries, we would do:

In [None]:
data.iloc[1:3, 0]

1   -114.47
2   -114.56
Name: longitude, dtype: float64

It's also possible to pass a list:

In [None]:
data.iloc[[0, 1, 2], 0]

0   -114.31
1   -114.47
2   -114.56
Name: longitude, dtype: float64

Finally, it's worth knowing that negative numbers can be used in selection. This will start counting forwards from the _end_ of the values. So for example here are the last five elements of the dataset.

In [None]:
data.iloc[-5:]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.3,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.3,41.8,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0
16999,-124.35,40.54,52.0,1820.0,300.0,806.0,270.0,3.0147,94600.0


### Label-based selection

The second paradigm for attribute selection is the one followed by the `loc` operator: **label-based selection**. In this paradigm, it's the data index value, not its position, which matters.

For example, to get the first entry in `data`, we would now do the following:

In [None]:
data.loc[0 , "longitude"]

-114.31

In [None]:
data.loc[:,["latitude","longitude"]]

Unnamed: 0,latitude,longitude
0,34.19,-114.31
1,34.40,-114.47
2,33.69,-114.56
3,33.64,-114.57
4,33.57,-114.57
...,...,...
16995,40.58,-124.26
16996,40.69,-124.27
16997,41.84,-124.30
16998,41.80,-124.30


## Conditional selection

So far we've been indexing various strides of data, using structural properties of the DataFrame itself. To do *interesting* things with the data, however, we often need to ask questions based on conditions. 

For example, suppose that we're interested specifically in better-than-average wines produced in Italy.

We can start by checking if each wine is Italian or not:

In [None]:
data.loc[1:5,['longitude','latitude']]

Unnamed: 0,longitude,latitude
1,-114.47,34.4
2,-114.56,33.69
3,-114.57,33.64
4,-114.57,33.57
5,-114.58,33.63


In [None]:
data.latitude==34.40

0        False
1         True
2        False
3        False
4        False
         ...  
16995    False
16996    False
16997    False
16998    False
16999    False
Name: latitude, Length: 17000, dtype: bool

This operation produced a Series of `True`/`False` booleans based on the `latitude` of each record.  This result can then be used inside of `loc` to select the relevant data:

In [None]:
data.loc[data.latitude == 34.40]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
388,-116.94,34.4,20.0,6541.0,1401.0,2631.0,980.0,2.1339,78500.0
1816,-117.27,34.4,8.0,6042.0,979.0,3031.0,991.0,3.3438,124400.0
8319,-118.46,34.4,12.0,25957.0,4798.0,10475.0,4490.0,4.542,195300.0
8579,-118.52,34.4,5.0,7748.0,1557.0,4768.0,1393.0,5.305,311200.0
8957,-118.9,34.4,16.0,2614.0,575.0,1163.0,524.0,1.5781,134400.0
8967,-118.91,34.4,30.0,2861.0,613.0,2065.0,586.0,3.2024,176100.0
8977,-118.92,34.4,23.0,1290.0,283.0,1060.0,279.0,3.3152,198000.0
8986,-118.93,34.4,17.0,3275.0,599.0,2422.0,637.0,3.7092,190500.0
9066,-118.98,34.4,34.0,1328.0,244.0,795.0,227.0,4.4219,338100.0


We can use the ampersand (`&`) to bring the two questions together:

In [None]:
data.loc[(data.latitude == 34.40) & (data.total_rooms >= 7650.0)]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
8319,-118.46,34.4,12.0,25957.0,4798.0,10475.0,4490.0,4.542,195300.0
8579,-118.52,34.4,5.0,7748.0,1557.0,4768.0,1393.0,5.305,311200.0


In [None]:
data.loc[(data.latitude == 34.40) | (data.total_rooms >= 7650.0)]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
133,-116.06,34.15,15.0,10377.0,2331.0,4507.0,1807.0,2.2466,66800.0
135,-116.09,34.15,13.0,9444.0,1997.0,4166.0,1482.0,2.6111,65600.0
168,-116.24,33.71,10.0,9033.0,2224.0,5525.0,1845.0,2.7598,95000.0
175,-116.29,33.74,6.0,12991.0,2555.0,4571.0,1926.0,4.7195,199300.0
...,...,...,...,...,...,...,...,...,...
16459,-122.61,38.42,13.0,7731.0,1360.0,2543.0,1249.0,4.6957,259800.0
16467,-122.61,37.99,40.0,7737.0,1488.0,3108.0,1349.0,4.4375,289600.0
16474,-122.62,38.40,10.0,9772.0,1308.0,3741.0,1242.0,6.5261,324700.0
16490,-122.63,38.26,7.0,7808.0,1390.0,3551.0,1392.0,4.6069,202300.0


# Summary Functions and Maps 

In [None]:
import pandas as pd

data = pd.read_csv("/content/sample_data/california_housing_train.csv")
data

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0


## Summary functions

Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way. For example, consider the `describe()` method:

In [None]:
data.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.562108,35.625225,28.589353,2643.664412,539.410824,1429.573941,501.221941,3.883578,207300.912353
std,2.005166,2.13734,12.586937,2179.947071,421.499452,1147.852959,384.520841,1.908157,115983.764387
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.79,33.93,18.0,1462.0,297.0,790.0,282.0,2.566375,119400.0
50%,-118.49,34.25,29.0,2127.0,434.0,1167.0,409.0,3.5446,180400.0
75%,-118.0,37.72,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,-114.31,41.95,52.0,37937.0,6445.0,35682.0,6082.0,15.0001,500001.0


This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [None]:
data.latitude.describe()# describe a panda series

count    17000.000000
mean        35.625225
std          2.137340
min         32.540000
25%         33.930000
50%         34.250000
75%         37.720000
max         41.950000
Name: latitude, dtype: float64

If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen. 

For example, to see the mean of the points allotted (e.g. how well an averagely rated wine does), we can use the `mean()` function:

In [None]:
data.latitude.mean()

35.6252247058827

In [None]:
data.longitude.std()

2.005166408426173

To see a list of unique values we can use the `unique()` function:

In [None]:
data.latitude.unique()

array([34.19, 34.4 , 33.69, 33.64, 33.57, 33.63, 33.61, 34.83, 33.62,
       33.6 , 34.84, 32.76, 34.89, 32.79, 32.74, 33.92, 33.49, 33.43,
       34.55, 33.82, 33.54, 32.82, 32.81, 32.86, 32.7 , 32.99, 33.19,
       32.8 , 32.68, 32.87, 32.69, 32.67, 32.75, 33.24, 33.12, 34.22,
       33.13, 32.98, 32.97, 32.77, 32.73, 34.91, 32.78, 32.96, 32.85,
       32.84, 32.83, 33.88, 33.2 , 33.04, 33.36, 33.35, 33.09, 33.26,
       34.2 , 32.93, 33.34, 35.55, 33.38, 33.28, 33.3 , 33.32, 33.4 ,
       33.51, 33.41, 34.18, 34.12, 33.33, 34.15, 33.86, 33.53, 34.14,
       33.68, 33.67, 33.66, 33.7 , 32.64, 33.75, 33.72, 33.71, 36.  ,
       34.21, 33.74, 33.73, 33.81, 33.65, 33.07, 34.13, 34.1 , 34.09,
       33.78, 33.79, 33.76, 34.16, 33.93, 33.77, 33.8 , 32.65, 34.07,
       33.94, 33.84, 33.98, 33.95, 34.85, 34.45, 33.96, 33.89, 33.97,
       33.85, 32.9 , 34.06, 33.83, 33.05, 35.43, 34.  , 33.06, 34.23,
       33.16, 33.56, 34.29, 33.46, 33.  , 33.99, 32.61, 34.24, 34.25,
       33.01, 32.92,

To see a list of unique values _and_ how often they occur in the dataset, we can use the `value_counts()` method:

In [None]:
data.latitude.value_counts()#frequency of unique values

34.06    205
34.08    200
34.05    196
34.07    194
34.04    188
        ... 
39.96      1
41.50      1
39.38      1
35.91      1
41.84      1
Name: latitude, Length: 840, dtype: int64

## Maps

A **map** is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values. In data science we often have a need for creating new representations from existing data, or for transforming data from the format it is in now to the format that we want it to be in later. Maps are what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often. 

[`map()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) is the first, and slightly simpler one. For example, suppose that we wanted to remean the scores the wines received to 0. We can do this as follows:

In [None]:
data.latitude

0        34.19
1        34.40
2        33.69
3        33.64
4        33.57
         ...  
16995    40.58
16996    40.69
16997    41.84
16998    41.80
16999    40.54
Name: latitude, Length: 17000, dtype: float64

In [None]:
latitude_mean = data.latitude.mean()
data.latitude.map(lambda p: p - latitude_mean)# applies for panda series



# def scale (dat):
#   dat=dat.latitude-data.latitude.mean()
#   return dat
# data.apply(scale,axis='columns')

  

0       -1.435225
1       -1.225225
2       -1.935225
3       -1.985225
4       -2.055225
           ...   
16995    4.954775
16996    5.064775
16997    6.214775
16998    6.174775
16999    4.914775
Name: latitude, Length: 17000, dtype: float64

In [None]:
def remean_latitude(row):
    row = row.latitude - latitude_mean
    return row

data.apply(remean_latitude, axis='columns')# for whole data

0       -1.435225
1       -1.225225
2       -1.935225
3       -1.985225
4       -2.055225
           ...   
16995    4.954775
16996    5.064775
16997    6.214775
16998    6.174775
16999    4.914775
Length: 17000, dtype: float64

In [None]:
def remean_latitude(row):
    row = row - latitude_mean
    return row

data.apply(remean_latitude, axis='columns')

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-149.935225,-1.435225,-20.625225,5576.374775,1247.374775,979.374775,436.374775,-34.131625,66864.374775
1,-150.095225,-1.225225,-16.625225,7614.374775,1865.374775,1093.374775,427.374775,-33.805225,80064.374775
2,-150.185225,-1.935225,-18.625225,684.374775,138.374775,297.374775,81.374775,-33.974325,85664.374775
3,-150.195225,-1.985225,-21.625225,1465.374775,301.374775,479.374775,190.374775,-32.433525,73364.374775
4,-150.195225,-2.055225,-15.625225,1418.374775,290.374775,588.374775,226.374775,-33.700225,65464.374775
...,...,...,...,...,...,...,...,...,...
16995,-159.885225,4.954775,16.374775,2181.374775,358.374775,871.374775,333.374775,-33.268125,111364.374775
16996,-159.895225,5.064775,0.374775,2313.374775,492.374775,1158.374775,429.374775,-33.107325,78964.374775
16997,-159.925225,6.214775,-18.625225,2641.374775,495.374775,1208.374775,420.374775,-32.593925,103564.374775
16998,-159.925225,6.174775,-16.625225,2636.374775,516.374775,1262.374775,442.374775,-33.645525,85764.374775


# Grouping and Sorting


Maps allow us to transform data in a DataFrame or Series one value at a time for an entire column. However, often we want to group our data, and then do something specific to the group the data is in. 

As you'll learn, we do this with the `groupby()` operation.  We'll also cover some additional topics, such as more complex ways to index your DataFrames, along with how to sort your data.

One function we've been using heavily thus far is the `value_counts()` function. We can replicate what `value_counts()` does by doing the following:

In [None]:
data.groupby('latitude').latitude.count()
#data.groupby('latitude')
#data.latitude.value_counts()

latitude
32.54     1
32.55     3
32.56     9
32.57    13
32.58    20
         ..
41.82     1
41.84     1
41.86     3
41.88     1
41.95     2
Name: latitude, Length: 840, dtype: int64

`groupby()` created a group of data which allotted the same point values to the given wines. Then, for each of these groups, we grabbed the `latitude` column and counted how many times it appeared.  `value_counts()` is just a shortcut to this `groupby()` operation. 

We can use any of the summary functions we've used before with this data. For example, to get the cheapest wine in each point value category, we can do the following:

In [None]:
data.groupby(['latitude']).latitude.min()
#data.groupby('latitude').count()
#data.latitude.value_counts()
#data.groupby('latitude').latitude.count()

latitude
32.54    32.54
32.55    32.55
32.56    32.56
32.57    32.57
32.58    32.58
         ...  
41.82    41.82
41.84    41.84
41.86    41.86
41.88    41.88
41.95    41.95
Name: latitude, Length: 840, dtype: float64

You can think of each group we generate as being a slice of our DataFrame containing only data with values that match. This DataFrame is accessible to us directly using the `apply()` method, and we can then manipulate the data in any way we see fit. For example, here's one way of selecting the name of the first wine reviewed from each winery in the dataset:

In [None]:
data.groupby('latitude').apply(lambda df: df.latitude.iloc[0])

latitude
32.54    32.54
32.55    32.55
32.56    32.56
32.57    32.57
32.58    32.58
         ...  
41.82    41.82
41.84    41.84
41.86    41.86
41.88    41.88
41.95    41.95
Length: 840, dtype: float64

Another `groupby()` method worth mentioning is `agg()`, which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:

In [None]:
data.groupby(['latitude']).latitude.agg([len, min, max])

Unnamed: 0_level_0,len,min,max
latitude,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
32.54,1,32.54,32.54
32.55,3,32.55,32.55
32.56,9,32.56,32.56
32.57,13,32.57,32.57
32.58,20,32.58,32.58
...,...,...,...
41.82,1,41.82,41.82
41.84,1,41.84,41.84
41.86,3,41.86,41.86
41.88,1,41.88,41.88


## Sorting

 
To get data in the order want it in we can sort it ourselves.  The `sort_values()` method is handy for this.

In [None]:
latitude_data = data.reset_index()
latitude_data

Unnamed: 0,index,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...,...
16995,16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0


In [None]:
latitude_data.sort_values(by="latitude")

Unnamed: 0,index,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
680,680,-117.04,32.54,7.0,938.0,297.0,1187.0,282.0,1.2667,67500.0
796,796,-117.06,32.55,5.0,3223.0,940.0,3284.0,854.0,1.4384,108800.0
995,995,-117.09,32.55,8.0,6533.0,1217.0,4797.0,1177.0,3.9583,144400.0
679,679,-117.04,32.55,15.0,2206.0,648.0,2511.0,648.0,1.6348,93200.0
1161,1161,-117.12,32.56,20.0,2524.0,682.0,1819.0,560.0,2.9286,257700.0
...,...,...,...,...,...,...,...,...,...,...
15516,15516,-122.33,41.86,19.0,3599.0,695.0,1572.0,601.0,2.2340,58600.0
16833,16833,-123.26,41.86,25.0,2344.0,532.0,1117.0,424.0,2.7222,64600.0
16883,16883,-123.83,41.88,18.0,1504.0,357.0,660.0,258.0,3.1300,116700.0
16497,16497,-122.64,41.95,18.0,1867.0,424.0,802.0,314.0,1.8242,53500.0


`sort_values()` defaults to an ascending sort, where the lowest values go first. However, most of the time we want a descending sort, where the higher numbers go first. That goes thusly:

In [None]:
latitude_data.sort_values(by='latitude', ascending=False)

NameError: ignored

To sort by index values, use the companion method `sort_index()`. This method has the same arguments and default order:

In [None]:
data.sort_index()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0


Finally, know that you can sort by more than one column at a time:

# Missing Values



### Detecting null values
Pandas data structures have two useful methods for detecting null data: ``isnull()`` and ``notnull()``.
Either one will return a Boolean mask over the data. For example:

In [None]:
data = pd.Series([1, np.nan, 'hello', None])
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [None]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [None]:
data.isna().sum()

2

In [None]:
data[data.notnull()]

0        1
2    hello
dtype: object

The ``isnull()`` and ``notnull()`` methods produce similar Boolean results for ``DataFrame``s.

### Dropping null values

In addition to the masking used before, there are the convenience methods, ``dropna()``
(which removes NA values) and ``fillna()`` (which fills in NA values). For a ``Series``,
the result is straightforward:

In [None]:
data.dropna()

0        1
2    hello
dtype: object

For a ``DataFrame``, there are more options.
Consider the following ``DataFrame``:

In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


We cannot drop single values from a ``DataFrame``; we can only drop full rows or full columns.
Depending on the application, you might want one or the other, so ``dropna()`` gives a number of options for a ``DataFrame``.

By default, ``dropna()`` will drop all rows in which *any* null value is present:

In [None]:
df.dropna()

Alternatively, you can drop NA values along a different axis; ``axis=1`` drops all columns containing a null value:

In [None]:
df.dropna(axis='columns')

But this drops some good data as well; you might rather be interested in dropping rows or columns with *all* NA values, or a majority of NA values.
This can be specified through the ``how`` or ``thresh`` parameters, which allow fine control of the number of nulls to allow through.

The default is ``how='any'``, such that any row or column (depending on the ``axis`` keyword) containing a null value will be dropped.
You can also specify ``how='all'``, which will only drop rows/columns that are *all* null values:

In [None]:
df[3] = np.nan
df

In [None]:
df.dropna(axis='columns', how='all')

For finer-grained control, the ``thresh`` parameter lets you specify a minimum number of non-null values for the row/column to be kept:

In [None]:
df.dropna(axis='rows', thresh=3)

Here the first and last row have been dropped, because they contain only two non-null values.

### Filling null values

Sometimes rather than dropping NA values, you'd rather replace them with a valid value.
This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.
You could do this in-place using the ``isnull()`` method as a mask, but because it is such a common operation Pandas provides the ``fillna()`` method, which returns a copy of the array with the null values replaced.

Consider the following ``Series``:

In [None]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

We can fill NA entries with a single value, such as zero:

In [None]:
data.fillna(0)

We can specify a forward-fill to propagate the previous value forward:

In [None]:
# forward-fill
data.fillna(method='ffill')

Or we can specify a back-fill to propagate the next values backward:

In [None]:
# back-fill
data.fillna(method='bfill')

For ``DataFrame``s, the options are similar, but we can also specify an ``axis`` along which the fills take place:

In [None]:
df

In [None]:
df.fillna(method='ffill', axis=1)