# Data Analysis:

- Data Analysis is a process of studying, cleaning, modeling, and transforming data with the purpose of finding useful information, suggesting conclusions, and supporting decision-making. 
 
 

# This are the step are used for Analysis:
- Data Collection
- Data Processing
- Data Cleaning
- Data Analysis
- Communication

# Tools Used in Data Analysis
- Microsoft Excel
- Python
- R
- Jupyter Notebook
- Apache Spark
- SAS
- Microsoft Power BI
- Tableau
- KNIME


# Introduction To Python Visualization

- Python, known for its simplicity and flexibility
- It has become a popular choice for data analysis and visualization. 
- Its rich ecosystem of libraries and tools makes it possible to create a wide range of visualizations, from basic graphs to interactive dashboards. 
- Whether you're a data analyst, a scientist, a business professional, or just someone intrigued by the world of data.

# Data Wrangling
* Getting & Reading data from different sources.
* Cleaning Data
* Shaping & Structuring Data
* Storing Data

There are many tools & libraries available for data wrangling. Tools like rapidminer & libraries like pandas. Organizations find libraries more suited because of flexibility.

# Pandas
* High Performance, Easy-to-use open source library for Data Analysis
* Creates tabular format of data from different sources like csv, json, database.
* Have utilities for descriptive statistics, aggregation, handling missing data
* Database utilities like merge, join are available
* Fast, Programmable & Easy alternative to spreadsheets

# We'll explore various libraries such as

**Matplotlib** :

is one of the most popular and foundational libraries for creating static, interactive, and animated visualizations.
It provides a wide range of plot types, customization options, and control over every aspect of the visualization.

**Seaborn** :

Built on top of Matplotlib, Seaborn specializes in creating aesthetically pleasing statistical visualizations.
It simplifies the process of generating complex visualizations like heatmaps, pair plots, and violin plots.

**Plotly** :

Plotly is known for interactive and web-based visualizations.
It offers both Python and JavaScript APIs, allowing you to create interactive plots that respond to user actions, making it suitable for dashboards and web applications.
Pandas Plotting:

The Pandas library includes built-in plotting functions that allow you to create basic visualizations directly from DataFrame objects.

# Types of Visualizations Python Can Create :
- Line Plots
- Bar Charts
- Pie Charts
- Scatter Plots 
- Histograms
- Heatmaps
- Box Plots

# Here are some real-world use cases for Python visualization  

1.**Financial Data Analysis:**
   - **Tool:** Matplotlib, Seaborn, Plotly
   - **Use Case:** Visualizing stock price trends, portfolio performance, and financial market data. Plot candlestick charts, line graphs, and scatter plots to analyze financial data.

**2. Healthcare Data Visualization:**
   - **Tool:** Matplotlib, Seaborn, Plotly
   - **Use Case:** Creating medical dashboards, visualizing patient data, and tracking disease outbreaks. Visualize patient demographics, disease prevalence, and treatment outcomes.

**3. E-commerce Analytics:**
   - **Tool:** Plotly, Matplotlib, Pandas
   - **Use Case:** Analyzing customer behavior, sales trends, and product performance. Create interactive dashboards to monitor key performance indicators (KPIs) like conversion rates and customer retention.

**4. Climate Data Visualization:**
   - **Tool:** Cartopy, Matplotlib, Plotly
   - **Use Case:** Visualizing climate data such as temperature patterns, rainfall, and sea-level rise. Create maps, heatmaps, and time series plots to understand climate change.

**5. Social Media Analytics:**
   - **Tool:** Plotly, Seaborn, NetworkX
   - **Use Case:** Analyzing social media engagement, sentiment analysis, and network connections. Visualize network graphs, word clouds, and sentiment trends.

**6. Scientific Research:**
   - **Tool:** Matplotlib, Plotly, Mayavi
   - **Use Case:** Visualizing experimental results, simulations, and scientific data. Create 3D plots, contour plots, and heatmaps to represent scientific findings.

**7. Marketing Campaign Analysis:**
   - **Tool:** Plotly, Matplotlib, Seaborn
   - **Use Case:** Evaluating the effectiveness of marketing campaigns, A/B testing, and customer segmentation. Visualize conversion funnels, click-through rates, and customer segments.

**8. Geospatial Data Visualization:**
   - **Tool:** Folium, Plotly, Matplotlib
   - **Use Case:** Mapping and geospatial analysis, including visualizing geographic data like population density, urban planning, and geographical features.

**9. Retail Inventory Management:** 
   - **Tool:** Matplotlib, Seaborn, Pandas
   - **Use Case:** Visualizing inventory levels, demand forecasting, and sales trends. Create bar charts, time series plots, and inventory heatmaps.

**10. Machine Learning Model Evaluation:**
     - **Tool:** Scikit-learn, Yellowbrick
     - **Use Case:** Evaluating machine learning model performance through visualizations like ROC  curves, confusion matrices, and feature importances.

**11. Network Analysis:**
    - **Tool:** NetworkX, Plotly
    - **Use Case:** Analyzing social networks, transportation networks, or cybersecurity. Visualize network structures, identify central nodes, and detect anomalies.

# Data cleaning using Python in Titanic Data

### Import the Necessary libraries for data cleaning

In [7]:

## Data Analysis libraries

import pandas as pd
import numpy as np

## Data Visualization libraries

import matplotlib.pyplot as plt
import seaborn as sns

In [10]:
data = pd.read_csv(r"C:\Users\Rahul\titanic_1.csv")
data.head()

Unnamed: 0,PassengerID,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### Issues with data

1. Missing value
2. Renaming the columns
3. Duplicated columns should be dropped
4. Check for Duplicate rows
5. Age should be an integer
6. Round the Fare column to 2 decimal
7. Passenger ID rearranges with proper number
8. Capitalize the column names
9. Replace the value in the column porperly
10. Sex and who Column are repeated.
11. Verify the data types of the column are proper
12. Alone and Sibsp

# Dropping the duplicate and unnecessary columns

In [11]:
data.columns

Index(['PassengerID', 'survived', 'pclass', 'sex', 'age', 'sibsp', 'parch',
       'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

1. Class and pclass columns are duplicated
2. survived and alive are duplicate
3. embarked and embark_town are duplicate
4. Solution is delete the duplicate columns from the data

In [12]:
cols = ['pclass', 'alive', 'embarked']
cols

['pclass', 'alive', 'embarked']

In [13]:
data = data.drop(cols, axis = 1)
data.head()

Unnamed: 0,PassengerID,survived,sex,age,sibsp,parch,fare,class,who,adult_male,deck,embark_town,alone
0,0,0,male,22.0,1,0,7.25,Third,man,True,,Southampton,False
1,1,1,female,38.0,1,0,71.2833,First,woman,False,C,Cherbourg,False
2,2,1,female,26.0,0,0,7.925,Third,woman,False,,Southampton,True
3,3,1,female,35.0,1,0,53.1,First,woman,False,C,Southampton,False
4,4,0,male,35.0,0,0,8.05,Third,man,True,,Southampton,True


## Delete the duplicate rows of the data

In [14]:
### data with original rows and columns
data.shape

(891, 13)

In [15]:
data.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Length: 891, dtype: bool

In [16]:
data[data.duplicated()]

Unnamed: 0,PassengerID,survived,sex,age,sibsp,parch,fare,class,who,adult_male,deck,embark_town,alone


- The above output is an empty DataFrame with only column names and no values present in the rows.
- This indicates that we do not have any duplicated rows present in the data

### Capitalize the columns

In [17]:
data.columns = data.columns.str.capitalize()
data.head()

Unnamed: 0,Passengerid,Survived,Sex,Age,Sibsp,Parch,Fare,Class,Who,Adult_male,Deck,Embark_town,Alone
0,0,0,male,22.0,1,0,7.25,Third,man,True,,Southampton,False
1,1,1,female,38.0,1,0,71.2833,First,woman,False,C,Cherbourg,False
2,2,1,female,26.0,0,0,7.925,Third,woman,False,,Southampton,True
3,3,1,female,35.0,1,0,53.1,First,woman,False,C,Southampton,False
4,4,0,male,35.0,0,0,8.05,Third,man,True,,Southampton,True


In [18]:
s = 'hello python'
s.capitalize()

'Hello python'

In [19]:
s.title()

'Hello Python'

In [20]:
data.columns

Index(['Passengerid', 'Survived', 'Sex', 'Age', 'Sibsp', 'Parch', 'Fare',
       'Class', 'Who', 'Adult_male', 'Deck', 'Embark_town', 'Alone'],
      dtype='object')

### Rename the columns of the data

In [21]:
dic = {
    'Sibsp' : 'Siblings/Spouse',
    'Parch' : 'Parents/Children',
    'Class' : 'Passenger Class'
}
dic

{'Sibsp': 'Siblings/Spouse',
 'Parch': 'Parents/Children',
 'Class': 'Passenger Class'}

In [22]:
data = data.rename(dic, axis = 1)
data.head()

Unnamed: 0,Passengerid,Survived,Sex,Age,Siblings/Spouse,Parents/Children,Fare,Passenger Class,Who,Adult_male,Deck,Embark_town,Alone
0,0,0,male,22.0,1,0,7.25,Third,man,True,,Southampton,False
1,1,1,female,38.0,1,0,71.2833,First,woman,False,C,Cherbourg,False
2,2,1,female,26.0,0,0,7.925,Third,woman,False,,Southampton,True
3,3,1,female,35.0,1,0,53.1,First,woman,False,C,Southampton,False
4,4,0,male,35.0,0,0,8.05,Third,man,True,,Southampton,True


### Replace male - Male, female - Female in Sex column
### Replace 0 - No and 1 - Yes in the Survived column

In [23]:
dic1 = {
    'male' : 'Male',
    'female' : 'Female'
}
dic1

{'male': 'Male', 'female': 'Female'}

In [24]:
dic2 = {
    0 : 'No',
    1 : 'Yes'
}
dic2

{0: 'No', 1: 'Yes'}

In [25]:
data['Sex'] = data['Sex'].replace(dic1)
data.head()

Unnamed: 0,Passengerid,Survived,Sex,Age,Siblings/Spouse,Parents/Children,Fare,Passenger Class,Who,Adult_male,Deck,Embark_town,Alone
0,0,0,Male,22.0,1,0,7.25,Third,man,True,,Southampton,False
1,1,1,Female,38.0,1,0,71.2833,First,woman,False,C,Cherbourg,False
2,2,1,Female,26.0,0,0,7.925,Third,woman,False,,Southampton,True
3,3,1,Female,35.0,1,0,53.1,First,woman,False,C,Southampton,False
4,4,0,Male,35.0,0,0,8.05,Third,man,True,,Southampton,True


In [26]:
data['Survived'] = data['Survived'].replace(dic2)
data.head()

Unnamed: 0,Passengerid,Survived,Sex,Age,Siblings/Spouse,Parents/Children,Fare,Passenger Class,Who,Adult_male,Deck,Embark_town,Alone
0,0,No,Male,22.0,1,0,7.25,Third,man,True,,Southampton,False
1,1,Yes,Female,38.0,1,0,71.2833,First,woman,False,C,Cherbourg,False
2,2,Yes,Female,26.0,0,0,7.925,Third,woman,False,,Southampton,True
3,3,Yes,Female,35.0,1,0,53.1,First,woman,False,C,Southampton,False
4,4,No,Male,35.0,0,0,8.05,Third,man,True,,Southampton,True


### Round the value of fare column to 2 decimal places

In [27]:
score = 72.45672389
score

72.45672389

In [28]:
round(score, 3)

72.457

In [29]:
data['Fare'] = round(data['Fare'], 2)
data.head(15)

Unnamed: 0,Passengerid,Survived,Sex,Age,Siblings/Spouse,Parents/Children,Fare,Passenger Class,Who,Adult_male,Deck,Embark_town,Alone
0,0,No,Male,22.0,1,0,7.25,Third,man,True,,Southampton,False
1,1,Yes,Female,38.0,1,0,71.28,First,woman,False,C,Cherbourg,False
2,2,Yes,Female,26.0,0,0,7.92,Third,woman,False,,Southampton,True
3,3,Yes,Female,35.0,1,0,53.1,First,woman,False,C,Southampton,False
4,4,No,Male,35.0,0,0,8.05,Third,man,True,,Southampton,True
5,5,No,Male,,0,0,8.46,Third,man,True,,Queenstown,True
6,6,No,Male,54.0,0,0,51.86,First,man,True,E,Southampton,True
7,7,No,Male,2.0,3,1,21.08,Third,child,False,,Southampton,False
8,8,Yes,Female,27.0,0,2,11.13,Third,woman,False,,Southampton,False
9,9,Yes,Female,14.0,1,0,30.07,Second,child,False,,Cherbourg,False


### Investigate the deck column

In [30]:
data['Deck'].value_counts()

C    59
B    47
D    33
E    32
A    15
F    13
G     4
Name: Deck, dtype: int64

In [31]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Passengerid       891 non-null    int64  
 1   Survived          891 non-null    object 
 2   Sex               891 non-null    object 
 3   Age               714 non-null    float64
 4   Siblings/Spouse   891 non-null    int64  
 5   Parents/Children  891 non-null    int64  
 6   Fare              891 non-null    float64
 7   Passenger Class   891 non-null    object 
 8   Who               891 non-null    object 
 9   Adult_male        891 non-null    bool   
 10  Deck              203 non-null    object 
 11  Embark_town       889 non-null    object 
 12  Alone             891 non-null    bool   
dtypes: bool(2), float64(2), int64(3), object(6)
memory usage: 78.4+ KB


In [32]:
688 / 891 * 100

77.21661054994388

- 77.21 % values are missing the deck columns, hence delete the deck column from the data

In [33]:
data = data.drop('Deck', axis = 1)
data.head()

Unnamed: 0,Passengerid,Survived,Sex,Age,Siblings/Spouse,Parents/Children,Fare,Passenger Class,Who,Adult_male,Embark_town,Alone
0,0,No,Male,22.0,1,0,7.25,Third,man,True,Southampton,False
1,1,Yes,Female,38.0,1,0,71.28,First,woman,False,Cherbourg,False
2,2,Yes,Female,26.0,0,0,7.92,Third,woman,False,Southampton,True
3,3,Yes,Female,35.0,1,0,53.1,First,woman,False,Southampton,False
4,4,No,Male,35.0,0,0,8.05,Third,man,True,Southampton,True


### Age should present as an integer column

In [34]:
## Checking the unique values present in the age column to decide whether age should be int or float

data['Age'].unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

- There are values in age like 0.67, 0.42, 0.92 etc. indicating that the passengers are travelling with babies. 
- If the age is converted from float to int then all the values like 0.67, 0.42 etc will become 0.(after losing the decimal values)

In [35]:
data.head()

Unnamed: 0,Passengerid,Survived,Sex,Age,Siblings/Spouse,Parents/Children,Fare,Passenger Class,Who,Adult_male,Embark_town,Alone
0,0,No,Male,22.0,1,0,7.25,Third,man,True,Southampton,False
1,1,Yes,Female,38.0,1,0,71.28,First,woman,False,Cherbourg,False
2,2,Yes,Female,26.0,0,0,7.92,Third,woman,False,Southampton,True
3,3,Yes,Female,35.0,1,0,53.1,First,woman,False,Southampton,False
4,4,No,Male,35.0,0,0,8.05,Third,man,True,Southampton,True


### Check wherther Sex and Who columns are duplicate

In [36]:
data['Sex'].unique()

array(['Male', 'Female'], dtype=object)

In [37]:
data['Who'].unique()

array(['man', 'woman', 'child'], dtype=object)

- The unique values of Sex and Who column are different indicating that Sex and Who are not duplicate columns. Hence, we cannot drop these columns

## Verify the data types of the columns are proper

In [38]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Passengerid       891 non-null    int64  
 1   Survived          891 non-null    object 
 2   Sex               891 non-null    object 
 3   Age               714 non-null    float64
 4   Siblings/Spouse   891 non-null    int64  
 5   Parents/Children  891 non-null    int64  
 6   Fare              891 non-null    float64
 7   Passenger Class   891 non-null    object 
 8   Who               891 non-null    object 
 9   Adult_male        891 non-null    bool   
 10  Embark_town       889 non-null    object 
 11  Alone             891 non-null    bool   
dtypes: bool(2), float64(2), int64(3), object(5)
memory usage: 71.5+ KB


In [39]:
data

Unnamed: 0,Passengerid,Survived,Sex,Age,Siblings/Spouse,Parents/Children,Fare,Passenger Class,Who,Adult_male,Embark_town,Alone
0,0,No,Male,22.0,1,0,7.25,Third,man,True,Southampton,False
1,1,Yes,Female,38.0,1,0,71.28,First,woman,False,Cherbourg,False
2,2,Yes,Female,26.0,0,0,7.92,Third,woman,False,Southampton,True
3,3,Yes,Female,35.0,1,0,53.10,First,woman,False,Southampton,False
4,4,No,Male,35.0,0,0,8.05,Third,man,True,Southampton,True
...,...,...,...,...,...,...,...,...,...,...,...,...
886,886,No,Male,27.0,0,0,13.00,Second,man,True,Southampton,True
887,887,Yes,Female,19.0,0,0,30.00,First,woman,False,Southampton,True
888,888,No,Female,,1,2,23.45,Third,woman,False,Southampton,False
889,889,Yes,Male,26.0,0,0,30.00,First,man,True,Cherbourg,True


### Filter the data where the passengers are travelling with their children

In [40]:
## Boolean Indexing
## data[data['Who'] == 'child']
data[data['Who'] == 'child']

Unnamed: 0,Passengerid,Survived,Sex,Age,Siblings/Spouse,Parents/Children,Fare,Passenger Class,Who,Adult_male,Embark_town,Alone
7,7,No,Male,2.00,3,1,21.08,Third,child,False,Southampton,False
9,9,Yes,Female,14.00,1,0,30.07,Second,child,False,Cherbourg,False
10,10,Yes,Female,4.00,1,1,16.70,Third,child,False,Southampton,False
14,14,No,Female,14.00,0,0,7.85,Third,child,False,Southampton,True
16,16,No,Male,2.00,4,1,29.12,Third,child,False,Queenstown,False
...,...,...,...,...,...,...,...,...,...,...,...,...
831,831,Yes,Male,0.83,1,1,18.75,Second,child,False,Southampton,False
850,850,No,Male,4.00,4,2,31.28,Third,child,False,Southampton,False
852,852,No,Female,9.00,1,1,15.25,Third,child,False,Cherbourg,False
869,869,Yes,Male,4.00,1,1,11.13,Third,child,False,Southampton,False


- The titanic ship was carrying 83 children in it.

### How many Male Children are travelling in titanic

In [41]:
data[(data['Sex'] == 'Male') & (data['Who'] == 'child')]

Unnamed: 0,Passengerid,Survived,Sex,Age,Siblings/Spouse,Parents/Children,Fare,Passenger Class,Who,Adult_male,Embark_town,Alone
7,7,No,Male,2.0,3,1,21.08,Third,child,False,Southampton,False
16,16,No,Male,2.0,4,1,29.12,Third,child,False,Queenstown,False
50,50,No,Male,7.0,4,1,39.69,Third,child,False,Southampton,False
59,59,No,Male,11.0,5,2,46.9,Third,child,False,Southampton,False
63,63,No,Male,4.0,3,2,27.9,Third,child,False,Southampton,False
78,78,Yes,Male,0.83,0,2,29.0,Second,child,False,Southampton,False
125,125,Yes,Male,12.0,1,0,11.24,Third,child,False,Cherbourg,False
164,164,No,Male,1.0,4,1,39.69,Third,child,False,Southampton,False
165,165,Yes,Male,9.0,0,2,20.52,Third,child,False,Southampton,False
171,171,No,Male,4.0,4,1,29.12,Third,child,False,Queenstown,False


In [42]:
data[(data['Sex'] == 'Male') & (data['Who'] == 'child')].count()

Passengerid         40
Survived            40
Sex                 40
Age                 40
Siblings/Spouse     40
Parents/Children    40
Fare                40
Passenger Class     40
Who                 40
Adult_male          40
Embark_town         40
Alone               40
dtype: int64

- 40 Male Children are travelling in titanic
- Total children 83, Male Children - 40, Female children = 83 - 40 = 43

In [43]:
data.head()

Unnamed: 0,Passengerid,Survived,Sex,Age,Siblings/Spouse,Parents/Children,Fare,Passenger Class,Who,Adult_male,Embark_town,Alone
0,0,No,Male,22.0,1,0,7.25,Third,man,True,Southampton,False
1,1,Yes,Female,38.0,1,0,71.28,First,woman,False,Cherbourg,False
2,2,Yes,Female,26.0,0,0,7.92,Third,woman,False,Southampton,True
3,3,Yes,Female,35.0,1,0,53.1,First,woman,False,Southampton,False
4,4,No,Male,35.0,0,0,8.05,Third,man,True,Southampton,True


## Column Indexing

In [44]:
data['Sex']

0        Male
1      Female
2      Female
3      Female
4        Male
        ...  
886      Male
887    Female
888    Female
889      Male
890      Male
Name: Sex, Length: 891, dtype: object

In [45]:
data[['Sex', 'Survived']]

Unnamed: 0,Sex,Survived
0,Male,No
1,Female,Yes
2,Female,Yes
3,Female,Yes
4,Male,No
...,...,...
886,Male,No
887,Female,Yes
888,Female,No
889,Male,Yes


#  Learning URL

https://www.learndatasci.com/best-data-analytics-courses/ 

https://www.geeksforgeeks.org/pandas-tutorial/

https://www.w3schools.com/python/pandas/default.asp

https://www.geeksforgeeks.org/python-data-analysis-using-pandas/?ref=shm