# **What is Data Science ?**

Data science is a multidisciplinary field that combines various techniques, tools, and methodologies to extract insights and knowledge from structured and unstructured data. It involves the application of scientific methods, statistical analysis, machine learning algorithms, and data visualization to understand complex patterns, make predictions, and derive valuable information from data.

Data science encompasses a wide range of tasks, including data collection, data cleaning and preprocessing, exploratory data analysis, feature engineering, model building, model evaluation, and interpretation of results. It involves working with large volumes of data, often referred to as big data, and leveraging computational power and advanced analytics techniques to uncover hidden patterns and gain actionable insights.

The main goal of data science is to use data to solve real-world problems, make informed decisions, and drive business growth. Data scientists employ a combination of programming skills, mathematical and statistical knowledge, domain expertise, and critical thinking to identify relevant questions, analyze data, build predictive models, and communicate their findings effectively to stakeholders.

Data science has applications in various industries and domains, such as finance, healthcare, marketing, social media, transportation, and many others. It plays a crucial role in enabling organizations to leverage the power of data to drive innovation, optimize processes, improve customer experiences, and gain a competitive advantage.

Overall, data science involves the integration of mathematics, statistics, computer science, and domain knowledge to extract meaningful insights and make data-driven decisions.

# **What is Machine Learning ?**

Machine learning is a subfield of artificial intelligence (AI) that focuses on the development of algorithms and models that enable computer systems to learn and improve from data without explicit programming. It is concerned with creating systems that can automatically analyze and interpret data, identify patterns, and make predictions or decisions.

In traditional programming, developers write explicit instructions to solve a particular problem. In machine learning, instead of being explicitly programmed, a model is trained on a large amount of data to learn patterns and relationships within that data. This training process allows the model to generalize and make predictions or take actions on new, unseen data.

The process of machine learning typically involves the following steps:

1. Data Collection: Gathering relevant data that represents the problem domain.

2. Data Preprocessing: Cleaning, transforming, and preparing the data for analysis by handling missing values, outliers, and formatting issues.

3. Feature Extraction/Selection: Identifying and selecting the most relevant features or variables that are likely to contribute to the model's performance.

4. Model Selection: Choosing an appropriate machine learning algorithm or model that suits the problem at hand.

5. Training: Using the training data to teach the model to recognize patterns and make predictions. This involves optimizing the model's parameters to minimize errors.

6. Evaluation: Assessing the model's performance on separate test data to measure its accuracy and generalization capabilities. This step helps in fine-tuning the model and comparing different models.

7. Deployment: Integrating the trained model into an application or system to make predictions or automate decision-making.

Machine learning encompasses various techniques, including supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves training models on labeled data, where the desired output is known, to make predictions on new, unseen data. Unsupervised learning involves finding patterns and structures in unlabeled data, often for tasks like clustering and dimensionality reduction. Reinforcement learning involves training agents to make decisions in an environment to maximize rewards or minimize penalties.

Machine learning has numerous applications, such as image and speech recognition, natural language processing, recommendation systems, fraud detection, autonomous vehicles, and medical diagnosis, among many others. It has revolutionized many industries by enabling automated decision-making, enhancing efficiency, and providing valuable insights from vast amounts of data.

# **How Machine Learning is useful in Data Science ?**

Machine learning plays a crucial role in data science as it provides the tools and techniques to analyze and extract insights from data. In the field of data science, machine learning algorithms are applied to large datasets to uncover patterns, make predictions, and derive meaningful information. Here are some key ways machine learning is used in data science:

1. Predictive Modeling: Machine learning algorithms are used to build predictive models that can make accurate predictions or classifications based on historical data. These models can be trained to forecast future trends, identify potential risks, or classify new data points into specific categories.

2. Classification and Regression: Machine learning algorithms are utilized for classification tasks, where the goal is to assign data points to predefined classes or categories. They can also perform regression analysis to estimate and predict continuous numerical values based on input features.

3. Clustering and Segmentation: Unsupervised learning algorithms, such as clustering algorithms, are employed to group similar data points together based on their characteristics. This helps in identifying patterns, segmenting customer groups, and discovering hidden structures in data.

4. Anomaly Detection: Machine learning techniques are used to detect outliers or anomalies in data that deviate significantly from the expected patterns. Anomaly detection algorithms can be applied in various domains like fraud detection, network security, and quality control.

5. Natural Language Processing (NLP): Machine learning models are used in NLP tasks to process and understand human language. They enable sentiment analysis, text classification, named entity recognition, language translation, and question-answering systems.

6. Recommender Systems: Machine learning algorithms power recommendation engines that suggest products, services, or content to users based on their preferences and historical behavior. These systems use collaborative filtering, content-based filtering, or hybrid approaches to provide personalized recommendations.

7. Feature Engineering: Machine learning techniques assist in selecting and engineering relevant features from raw data. This involves transforming and extracting meaningful representations from the data to improve the performance of predictive models.

8. Deep Learning: Deep learning, a subset of machine learning, involves training deep neural networks with multiple layers to perform complex tasks like image recognition, speech synthesis, and natural language processing. Deep learning models have achieved state-of-the-art performance in various domains.

9. Model Evaluation and Optimization: Machine learning helps in assessing the performance of models by evaluating metrics like accuracy, precision, recall, and F1-score. Techniques like cross-validation and hyperparameter tuning are used to optimize and fine-tune models for better results.

Overall, machine learning techniques provide the foundation for data scientists to extract insights, build predictive models, and make data-driven decisions in various fields. They enable data scientists to unlock the value of data and derive actionable insights to solve complex problems.

# **What is Analytics ?**

Analytics is the systematic process of analyzing data to uncover meaningful patterns and insights. It involves using statistical techniques and data exploration methods to gain a deeper understanding of historical data, make predictions about future events, and provide recommendations for optimized decision-making. Analytics helps organizations derive valuable insights from data to drive business growth, improve operational efficiency, and gain a competitive advantage.

# **How Python is used in Machine Learning and Data Science ?**

Python is a widely used programming language in the fields of machine learning and data science due to its simplicity, versatility, and extensive ecosystem of libraries and frameworks. Here are some key ways Python is used in machine learning and data science:

1. Data Manipulation and Preprocessing: Python provides libraries like NumPy and pandas, which offer powerful data structures and functions for data manipulation, cleaning, and preprocessing. These libraries allow data scientists to efficiently handle large datasets, perform data transformations, handle missing values, and prepare data for analysis.

2. Machine Learning Libraries: Python has robust libraries like scikit-learn, TensorFlow, and Keras that provide a wide range of machine learning algorithms and tools. These libraries offer pre-implemented algorithms for classification, regression, clustering, dimensionality reduction, and more. Python's simplicity and the availability of these libraries make it easier for data scientists to implement and experiment with various machine learning models.

3. Data Visualization: Python offers popular data visualization libraries such as Matplotlib, Seaborn, and Plotly, which enable the creation of visually appealing and informative graphs, charts, and plots. These libraries allow data scientists to visualize patterns, relationships, and trends within data, helping in data exploration, model evaluation, and communication of findings.

4. Deep Learning: Python is extensively used in deep learning frameworks like TensorFlow and PyTorch, which provide tools for building and training neural networks. These frameworks offer flexibility and efficiency for creating complex deep learning models, such as convolutional neural networks (CNNs) for image recognition or recurrent neural networks (RNNs) for natural language processing.

5. Jupyter Notebooks: Python integrates seamlessly with Jupyter Notebooks, which are interactive environments for data science and machine learning. Jupyter Notebooks allow data scientists to combine code, visualizations, and explanatory text in a single document, facilitating the development, documentation, and sharing of data science projects and analyses.

6. Integration with Other Tools: Python can be easily integrated with other data science tools and technologies. For example, it can connect to databases through libraries like SQLAlchemy, interface with big data frameworks like Apache Spark, and interact with cloud-based services like AWS or Google Cloud.

7. Community and Ecosystem: Python has a vibrant and active community of data scientists, machine learning practitioners, and researchers. This community contributes to a vast ecosystem of open-source libraries, resources, and tutorials, making it easier to access and learn from a wealth of shared knowledge and code.

# **What is Python ?**

Python is a widely-used, interpreted, object-oriented, and high-level, general purpose programming language with dynamic semantics, used for  general-purpose programming. It’s everywhere,and people use numerous  Python-powered devices on a daily basis, whether they realize it or not.Its design philosophy emphasizescode readability with the use of significant  indentation. Python is dynamically typed and garbage-collected. It    supports multipleprogramming paradigms,including structure (particularly procedural),object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.

Python was created byGuido van Rossum, and first released on February 20, 1991. While you may know the python as a large snake, the name of the Python programming language comes from an old   BBC   television   comedy   sketch   series   calledMonty  Python’s  Flying  Circus.Guido   van Rossumbegan  working  on  Python  in  the  late  1980s  as  a  successor  to  theABC  programming languageandfirst released it in 1991 as Python0.9.0.Python2.0 was released in 2000. Python3.0, released in 2008, was a major revision not completelybackward-compatiblewith earlier versions. Python2.7.18, released in 2020, was the last release of Python2.Python consistently ranks as one of the most popular programming languages.

One of the amazing features of Python is the fact that it is actually one person’s work. Usually, new programming  languages  are  developed  and  published  by  large  companies  employing  lots  of professionals, and due to copyright rules, it is very hard to name any of the people involved in the project. Python is an exception.

Of course, Guido van Rossum did not develop and evolve all the Python components himself. The speed  with  which  Pythonhas  spread  around  the  world  is  a  result  of  the  continuous  work  of thousands (very often anonymous) programmers, testers, users (many of them aren’t IT specialists) and enthusiasts, but it must be said that the very first idea (the seed from which Python sprouted) came to one head –Guido’s.

Python is maintained by thePython Software Foundation, a non-profit membership organization and a community devoted to developing, improving, expanding, and popularizing the Python language and its environment.

Python is omnipresent, and people use numerous Python-powered devices on a daily basis, whether they  realize  it  or  not.  There  are  billions  of  lines  of  code  written  in  Python,  which  means  almost unlimited opportunities for code reuse and learning from well-crafted examples. What’s more, there is a large and very active Python community, always happy to help.

There are also a couple of factors that make Python great for learning:

1. **It is easy to learn** – The time needed to learn Python is shorter than for many other languages; this means that it’s possible to start the actual programming faster.

2. **It is easy to use for writing new software** – It’s often possible to write code faster when using Python

3. **It’s open source,is easy to obtain, install and deploy** – Python isfree, open and multiplatform; not all languages can boast that.

4. Some  languages  require  you  to  modify  code  to  run  on  different  platforms,  but  Python  is  a cross-platform language, which means you can run the same code on any operating system with a Python interpreter.

5. **It’s extendable** - Python code can be written in other languages (such as C++), and users can add low-level modules to the Python interpreter to customize and optimize their tools.

6. **It has a multiple standard library** - This library is available foranyone to access and means that users don’t have to write code for every single function—they can access built-in modules that help with issues in everyday programming and more.

Python is the programming language that opens more doors than any other. Witha solid knowledge of Python, you can work in a multitude of jobs and a multitude of industries. And even if you don’t need it for work, you will still find it useful to know to speed certain things up or develop a deeper understanding of other concepts.And the more you understand Python, the more you can do in the 21st Century. Even if you don’t need it for work, you will find it useful to know.

Python  is  a  great  choice  for  career  paths  related  to  software  development,  engineering,  DevOps, machine learning, data analytics, web development, and testing. What's more, there are also many jobs outside the IT industry that use Python. Since our lives are becoming more computerized every day,  and  the  computer  and  technology  areas  previously  associated  only  with technically  gifted people are now opening up to non-programmers, Python has become one of the must-have tools in the  toolbox  of  educators,  managers,  data  scientists,  data  analysts,  economists,  psychologists, artists, and even secretaries.

Python  is  a  general  purpose,  open  source,  high-level  programming  language  and  also  provides number of libraries and frameworks. Python has gained popularity because of its simplicity, easy syntax and user-friendly environment.

# **The usage of Python as follows:**

1. Desktop Applications
2. Web Applications
3. Data Science
4. Artificial Intelligence
5. Machine Learning
6. Scientific Computing
7. Robotics
8. Internet of Things (IoT)
9. Gaming
10. Mobile Apps
11. Data Analysis and Pre-processing

# **Organizations using Python :**

1. Google(Components of Google spider and Search Engine)
2. Yahoo(Maps)
3. YouTube
4. Mozilla
5. Dropbox
6. Microsoft
7. Cisco
8. Spotify


# **What is Pandas Library in Python ?**

Pandas is a popular open-source data manipulation and analysis library for Python. It provides data structures and functions that make it easier to work with structured data, such as tabular data and time series data.

Some key features of Pandas include:

1. DataFrame: The primary data structure in Pandas is the DataFrame, which is a two-dimensional table-like data structure. It allows you to store and manipulate data in a row-column format, similar to a spreadsheet or a SQL table.

2. Data manipulation: Pandas provides a wide range of functions for data manipulation, including filtering, sorting, merging, joining, grouping, and reshaping data. These functions allow you to perform various data transformations and calculations efficiently.

3. Data cleaning and preprocessing: Pandas offers functions for handling missing data, dealing with duplicates, and performing data cleaning operations. It also provides tools for data preprocessing tasks, such as data normalization, data encoding, and feature engineering.

4. Data analysis and exploration: Pandas supports various statistical and exploratory data analysis operations. You can calculate descriptive statistics, perform aggregations, apply mathematical operations, and create visualizations using Pandas functions and integration with other libraries, such as Matplotlib and Seaborn.

5. Data input and output: Pandas supports reading and writing data in various formats, including CSV, Excel, SQL databases, and more. It provides functions to import data into a DataFrame from external sources and export DataFrame data to different file formats.

Pandas is widely used in data analysis, data science, and machine learning projects due to its flexibility, efficiency, and rich set of features. It simplifies many common data manipulation tasks and allows you to work with data effectively in Python.

# **What is Numpy Library in Python ?**

NumPy (Numerical Python) is a powerful open-source library for numerical computing in Python. It provides high-performance multidimensional array objects (ndarrays) and a collection of functions for operating on these arrays efficiently.

Here are some key features of NumPy:

1. ndarray: NumPy's ndarray is a homogeneous, n-dimensional array object that can store elements of the same data type. It provides efficient storage and manipulation of large arrays of numerical data. The ndarray allows for vectorized operations, which enable performing calculations on entire arrays rather than individual elements, resulting in faster and more concise code.

2. Mathematical functions: NumPy provides a wide range of mathematical functions that operate efficiently on ndarrays. These functions include mathematical operations like addition, subtraction, multiplication, division, exponentiation, trigonometric functions, logarithmic functions, and more.

3. Broadcasting: NumPy's broadcasting feature allows for element-wise operations between arrays of different shapes and sizes. It simplifies the process of performing operations on arrays with different dimensions by automatically aligning the arrays to perform element-wise calculations.

4. Linear algebra: NumPy includes functions for linear algebra operations such as matrix multiplication, matrix decomposition (e.g., LU, QR, SVD), solving linear equations, and eigenvalue calculations. These functions provide efficient implementations for common linear algebra tasks.

5. Random number generation: NumPy offers functions to generate random numbers from various distributions. These functions are useful in simulations, modeling, and statistical analysis.

6. Integration with other libraries: NumPy serves as the foundation for many other libraries in the scientific Python ecosystem. It integrates well with libraries like SciPy, pandas, Matplotlib, scikit-learn, and more, providing a powerful environment for scientific computing and data analysis.

NumPy is widely used in various domains, including data analysis, scientific research, machine learning, and numerical simulations. Its efficient array operations and mathematical functions make it a fundamental library for numerical computing in Python.

# **What is Matplotlib Library in Python ?**

Matplotlib is a popular open-source library in Python for creating visualizations and plots. It provides a wide range of functions and tools for generating high-quality static, animated, and interactive visualizations.

Here are some key features of Matplotlib:

1. Plotting functions: Matplotlib offers a comprehensive set of functions for creating various types of plots, including line plots, scatter plots, bar plots, histograms, pie charts, box plots, and more. These functions allow you to customize the appearance of plots, such as colors, markers, line styles, labels, and titles.

2. Object-oriented API: Matplotlib provides an object-oriented API that allows fine-grained control over the appearance and layout of plots. This API allows you to create multiple plots within a figure, customize axes, add legends, annotations, and more.

3. Integration with NumPy: Matplotlib seamlessly integrates with NumPy, allowing you to plot NumPy arrays directly. This makes it easy to visualize data stored in NumPy arrays or perform calculations and transformations on the data before plotting.

4. Publication-quality output: Matplotlib produces high-quality visualizations suitable for publication and presentations. You can save plots in various file formats, including PNG, PDF, SVG, and more. Matplotlib provides options for customizing the resolution, size, and other parameters to meet your specific requirements.

5. Support for different styles and backends: Matplotlib allows you to customize the appearance of plots by using different styles and themes. It also supports different graphical backends, enabling you to display plots in different environments, including interactive interfaces, Jupyter notebooks, and web applications.

6. 3D plotting: Matplotlib includes functionality for creating 3D plots and visualizations. It supports surface plots, scatter plots, contour plots, and other types of 3D visualizations.

Matplotlib is widely used in various fields, such as data analysis, scientific research, engineering, finance, and more. Its versatility, ease of use, and extensive customization options make it a powerful tool for data visualization in Python.

# **How these three Libraries are helpful in Data Science and  Machine Learning ?**

The three libraries - Pandas, NumPy, and Matplotlib - play crucial roles in data science and machine learning workflows. Here's how each of them is helpful:

1. Pandas: Pandas is a powerful library for data manipulation and analysis. It provides data structures and functions that make it easy to clean, transform, and explore data. Some key features of Pandas include:
   - Data manipulation: Pandas offers data structures like DataFrames and Series that allow you to handle and manipulate structured data effectively.
   - Data cleaning: Pandas provides functions for handling missing values, removing duplicates, and transforming data.
   - Data exploration: Pandas supports various operations for data exploration, such as filtering, grouping, sorting, and aggregation.
   - Data integration: Pandas facilitates merging, joining, and reshaping datasets, enabling you to combine and transform data from multiple sources.

2. NumPy: NumPy is a fundamental library for numerical computing in Python. It provides support for large, multi-dimensional arrays and a collection of mathematical functions. Some key features of NumPy include:
   - N-dimensional arrays: NumPy's ndarray object allows efficient storage and manipulation of large arrays of homogeneous data.
   - Mathematical operations: NumPy provides a wide range of mathematical functions for performing operations on arrays, such as element-wise calculations, linear algebra, Fourier transforms, and random number generation.
   - Integration with other libraries: NumPy serves as the foundation for many other libraries in the scientific Python ecosystem, enabling seamless integration with tools like Pandas, Matplotlib, and scikit-learn.

3. Matplotlib: Matplotlib is essential for data visualization and creating plots. It allows you to generate a wide range of static, animated, and interactive visualizations to explore and communicate data. In data science and machine learning, Matplotlib helps in:
   - Visualizing data distributions: Matplotlib allows you to create histograms, box plots, and density plots to understand the distribution and statistics of your data.
   - Creating line and scatter plots: Matplotlib enables you to plot time series data, relationships between variables, and patterns in data using line plots and scatter plots.
   - Building custom visualizations: Matplotlib provides a highly customizable API, allowing you to create complex visualizations tailored to your specific needs.
   - Presenting results: Matplotlib helps you present your findings, insights, and model outputs through clear and informative plots, enhancing the interpretability of your work.

These libraries, together with other tools in the Python ecosystem like scikit-learn, provide a comprehensive environment for data analysis, modeling, and machine learning tasks. They offer efficient data handling, numerical computation, and visualization capabilities that are essential for data scientists and machine learning practitioners.

# **What is Scikit Library in Python ?**

The SciKit library in Python, commonly known as scikit-learn, is a powerful machine learning library that provides a wide range of tools for data analysis and predictive modeling. It is built on top of NumPy, SciPy, and matplotlib, and is one of the most widely used libraries in the field of machine learning.

Here are some key features and functionalities of scikit-learn:

1. Machine Learning Algorithms: scikit-learn offers a comprehensive collection of machine learning algorithms, including supervised and unsupervised learning methods. It includes algorithms for classification, regression, clustering, dimensionality reduction, and more. These algorithms are implemented in a consistent and user-friendly API, making it easy to experiment and compare different models.

2. Data Preprocessing and Feature Engineering: scikit-learn provides a variety of preprocessing techniques to prepare your data before feeding it into machine learning algorithms. It offers tools for data cleaning, feature scaling, feature extraction, and feature selection. These preprocessing steps are essential for improving the quality and suitability of your data for modeling.

3. Model Evaluation and Selection: scikit-learn offers a range of evaluation metrics and techniques to assess the performance of machine learning models. It provides functions for computing accuracy, precision, recall, F1-score, and many other evaluation metrics. Additionally, it offers tools for cross-validation, hyperparameter tuning, and model selection, allowing you to choose the best model for your task.

4. Integration with NumPy and Pandas: scikit-learn seamlessly integrates with NumPy and Pandas, making it easy to handle and process data in their respective data structures. It can directly accept NumPy arrays and Pandas DataFrames as input for training and prediction tasks.

5. Pipeline and Feature Union: scikit-learn provides a powerful Pipeline and FeatureUnion mechanism that allows you to chain multiple preprocessing steps and machine learning algorithms into a single unit. This helps in creating robust and scalable machine learning workflows and facilitates the automation of repetitive tasks.

6. Extensibility and Community: scikit-learn is an open-source library with an active community of developers and users. It offers a rich ecosystem of extensions and contributions, including additional algorithms, utilities, and datasets. This makes it easy to leverage existing implementations and expand the functionality of scikit-learn.

Overall, scikit-learn simplifies the process of building and deploying machine learning models by providing a high-level and intuitive interface. It is widely used in various domains, including data science, research, industry, and academia, to solve a wide range of machine learning problems.

# **What is Keras Library in Python ?**

Keras is a high-level neural networks library written in Python. It is designed to be user-friendly, modular, and extensible, making it an excellent choice for implementing deep learning models. Keras provides a simple and intuitive interface to build, train, and evaluate neural networks, abstracting away the complexities of lower-level frameworks like TensorFlow or Theano.

Here are some key features of the Keras library:

1. User-Friendly API: Keras offers a user-friendly and intuitive API that allows you to quickly build and prototype neural networks. It provides a set of high-level building blocks called "layers" that can be easily combined to create complex neural network architectures. With Keras, you can define a neural network by simply stacking the layers together.

2. Modular and Extensible: Keras follows a modular design philosophy, allowing you to easily mix and match different layers, loss functions, and optimizers to create custom neural network architectures. It provides a wide range of pre-built layers and utilities, making it easy to experiment with different configurations and adapt models to various tasks.

3. Multi-Backend Support: Keras supports multiple backend frameworks, including TensorFlow, Theano, and Microsoft Cognitive Toolkit (CNTK). This means you can choose the backend that best suits your needs or seamlessly switch between backends without modifying your code. TensorFlow has become the default backend for Keras since its integration with TensorFlow 2.0.

4. Deep Learning Building Blocks: Keras provides a rich set of building blocks for deep learning, including various types of layers (dense, convolutional, recurrent, etc.), activation functions, regularization techniques, and optimization algorithms. These components can be easily combined to create complex architectures for tasks such as image classification, natural language processing, and more.

5. Easy Model Training and Evaluation: Keras simplifies the training and evaluation process by providing high-level functions and utilities. It offers a range of loss functions, metrics, and callbacks that can be used during the training process. Keras also supports GPU acceleration, allowing you to leverage the computational power of GPUs to train models faster.

6. Pre-Trained Models and Transfer Learning: Keras provides access to a collection of pre-trained models that have been trained on large datasets. These models can be used as a starting point for your own tasks or as feature extractors in transfer learning scenarios. This enables you to benefit from state-of-the-art architectures and weights without the need for extensive training.

7. Integration with TensorFlow and Ecosystem: With its integration with TensorFlow, Keras benefits from TensorFlow's computational graph capabilities and ecosystem. This allows you to leverage the powerful TensorFlow features while enjoying the simplicity and ease of use of Keras.

Keras has gained widespread popularity in the deep learning community due to its simplicity, flexibility, and extensive documentation. It is widely used for building and training deep learning models, and it serves as a high-level interface for TensorFlow, enabling developers to quickly iterate and experiment with neural network architectures.

In [2]:
import pandas as pd

Here we imported pandas library , now we can use various function of Pandas Library

In [None]:
pd.read_csv('https://raw.githubusercontent.com/YBI-Foundation/Dataset/main/Titanic.csv')

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.00,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.00,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.00,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.00,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.50,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.50,0,0,2656,7.2250,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.00,0,0,2670,7.2250,,C,,,


Here we have read the data from csv file by passing url in read_csv(file_url) Parameter

Note : Here we only view the file to store file we have assign all data in one variable ,
lets collaborate this ,

In [3]:
titanic = pd.read_csv('https://raw.githubusercontent.com/YBI-Foundation/Dataset/main/Titanic.csv')

Now we have stored the data in titanic dataframe or variable we can say , So Now we will use the three function head() , info() and describe function

In [None]:
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [None]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


In [None]:
titanic.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881138,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.413493,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.17,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


head() functions is used to show the first 5 rows , so we can check our data is properly imported or not.

info() :- This functions shows the info of dataframe , class of dataframe , how many entries exist , how many column exist , which column is used in dataframe or what is its datatype,etc.

describe() : display summary statistics of numerical columns

In [4]:
titanic.tail()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C,,,
1308,3,0,"Zimmerman, Mr. Leo",male,29.0,0,0,315082,7.875,,S,,,


tail() :- This function is just like head reading five rows , it reads last 5 five rows and head() reads first 4 rows.

In [5]:
titanic.describe(include='all')

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
count,1309.0,1309.0,1309,1309,1046.0,1309.0,1309.0,1309,1308.0,295,1307,486.0,121.0,745
unique,,,1307,2,,,,929,,186,3,27.0,,369
top,,,"Connolly, Miss. Kate",male,,,,CA. 2343,,C23 C25 C27,S,13.0,,"New York, NY"
freq,,,2,843,,,,11,,6,914,39.0,,64
mean,2.294882,0.381971,,,29.881138,0.498854,0.385027,,33.295479,,,,160.809917,
std,0.837836,0.486055,,,14.413493,1.041658,0.86556,,51.758668,,,,97.696922,
min,1.0,0.0,,,0.17,0.0,0.0,,0.0,,,,1.0,
25%,2.0,0.0,,,21.0,0.0,0.0,,7.8958,,,,72.0,
50%,3.0,0.0,,,28.0,0.0,0.0,,14.4542,,,,155.0,
75%,3.0,1.0,,,39.0,1.0,0.0,,31.275,,,,256.0,


In [7]:
titanic.shape

(1309, 14)

Display Shape  , means total number of records/data and total number of columns in dataframe.

In [8]:
titanic.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

It Display's Column head of dataframe

In [11]:
titanic['age']

0       29.00
1        0.92
2        2.00
3       30.00
4       25.00
        ...  
1304    14.50
1305      NaN
1306    26.50
1307    27.00
1308    29.00
Name: age, Length: 1309, dtype: float64

It reads the records/data of column name 'age'

In [10]:
titanic['age'].shape

(1309,)

It display total number of records present in column 'age' but it will not display total number of column because we have specified column name.

In [13]:
titanic[['age']]

Unnamed: 0,age
0,29.00
1,0.92
2,2.00
3,30.00
4,25.00
...,...
1304,14.50
1305,
1306,26.50
1307,27.00


It selects the column as dataframe

In [14]:
titanic[['age']].shape

(1309, 1)

It displays the total number of records present and total number of column present in dataframe , it display the total number of column because we have selected specific column as dataframe , and last time we selected specified column as series.

In [17]:
titanic['sex'].unique()

array(['female', 'male'], dtype=object)

It displays only unique category in a column

In [18]:
titanic['sex'].nunique()

2

It display number of unique category in column

In [19]:
titanic['sex'].value_counts()

male      843
female    466
Name: sex, dtype: int64

It Display Categories wise number of observation in a column. for example , we selected column 'sex' and it has two categories 'Male' & 'Female' , so it will display total number of male and total number of female.

In [20]:
titanic.isna().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

It display count of missing values in column's.

In [21]:
titanic.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

It display count of null values in column's.