# IPv4

<img src="res/ip01.png" width="500px">
<img src="res/ip02.png" width="500px">
<img src="res/ip03.png" width="500px">
<img src="res/ip04.png" width="500px">
<img src="res/ip05.png" width="500px">

1. IP fragment/defragment 本身有一些问题，比如乱序，id重复，丢失等等。这些问题ip层面是解决不了的，只能依赖上层协议（比如tcp）

<img src='res/ip06.png' width=500px>


-------------------------
-------------------------

## linux2.6/net/ipv4/ip_fragment.c

```c
/* Describe an entry in the "incomplete datagrams" queue. */
struct ipq {
	struct ipq	*next;		/* linked list pointers			*/
	struct list_head lru_list;	/* lru list member 			*/
	u32		user;
	u32		saddr;
	u32		daddr;
	u16		id;
	u8		protocol;
	u8		last_in;
#define COMPLETE		4
#define FIRST_IN		2
#define LAST_IN			1

	struct sk_buff	*fragments;	/* linked list of received fragments	*/
	int		len;		/* total length of original datagram	*/
	int		meat;
	spinlock_t	lock;
	atomic_t	refcnt;
	struct timer_list timer;	/* when will this queue expire?		*/
	struct ipq	**pprev;
	int		iif;
	struct timeval	stamp;
};
```

1. 这是一个双向链表的节点。有意思的点在于其 prev 的指针是一个指针的指针，存储的是prev节点中的next指针的地址。这个**实现很巧妙**

2. 每个queue的头需要存在hash table中。如果hash table直接存放节点struct，每次增删，rehash，代价很大。所以只能存放节点指针

3. 如果pprev不用 double 指针，那么每次操作比如删除，都要判断下当前是不是队列头，要对hash表中的位置进行单独操作。而用 double pointer，无论操作队列头还是中间节点，都是一样的

4. 这其实是kernel中hash link的统一的数据结构

### linux2.6/include/list.h

Hash List 结构

```c
/*
 * Double linked lists with a single pointer list head.
 * Mostly useful for hash tables where the two pointer list head is
 * too wasteful.
 * You lose the ability to access the tail in O(1).
 */

struct hlist_head {
	struct hlist_node *first;
};

struct hlist_node {
	struct hlist_node *next, **pprev;
};

```

普通 List 结构

```c
/*
 * Simple doubly linked list implementation.
 *
 * Some of the internal functions ("__xxx") are useful when
 * manipulating whole lists rather than single entries, as
 * sometimes we already know the next/prev entries and we can
 * generate better code by using them directly rather than
 * using the generic single-entry routines.
 */

struct list_head {
	struct list_head *next, *prev;
};

```



5. 具体可以参看 [kernel hlist](res/ip07.png)

	[kernel hlist详解](https://blog.csdn.net/hs794502825/article/details/24597773)


6. ipq是hash list的节点，为的是处理 hash 冲突，**而不是一个packet的不同段**。不同的ipq对应不同的 ip packet，每个ip packet（也就是每个节点）的fragments 在其内部的 fragments list存储，这是个sk_buff的list。

7. 当然，用double pointer带来的问题就是，没法直接访问上一个节点了。因为大部分情况并不需要访问上个节点的具体field，主要就是节点的增删，所以足够了。当然，如果一定要访问，可以用linux中经典的宏 container_of

* container_of 宏

```c
/**
 * container_of - cast a member of a structure out to the containing structure
 *
 * @ptr:	the pointer to the member.
 * @type:	the type of the container struct this is embedded in.
 * @member:	the name of the member within the struct.
 *
 */
#define container_of(ptr, type, member) ({			\
        const typeof( ((type *)0)->member ) *__mptr = (ptr);	\
        (type *)( (char *)__mptr - offsetof(type,member) );})
```

* offsetof 宏

```c
#ifdef __compiler_offsetof
#define offsetof(TYPE,MEMBER) __compiler_offsetof(TYPE,MEMBER)
#else
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
#endif
```


8. last_in 的几种状态

```c
#define COMPLETE		4
#define FIRST_IN		2
#define LAST_IN			1
```

IP header 有 MF(must fragment)/Offset字段，MF=1，表明这是一个fragment，而且后面还有。MF=0表示这是最后一个。第一个fragment的offset=0。

所以，当fragment的offset=0，就将ipq的FIRST_IN设置。当MF=0，就将LAST_IN设置。当 超时/reassemble/出错等状态，就将COMPLETE设置。

-----------------

### struct sk_buff *ip_defrag(struct sk_buff *skb, u32 user)

```c
/* Process an incoming IP datagram fragment. */
struct sk_buff *ip_defrag(struct sk_buff *skb, u32 user)
```

1. 这个函数处理每个进来的fragment，其会找到对应的ipq，然后插入到里面。如果

```c
		if (qp->last_in == (FIRST_IN|LAST_IN) &&
		    qp->meat == qp->len)
```

就将多个fragments（sk_buff list）合并成一个sk_buff，完成reassemble

-----------------

### static void ip_frag_queue(struct ipq *qp, struct sk_buff *skb)

1. 将fragment放入队列。因为fragment可能有错误、乱序、overlap，内部对这些都做了处理

```c
/* Add new segment to existing queue. */
static void ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
{  
    。。。。。。。。

    /* xitongsys
        当前fragment在总的packet数据中end的位置。也就是 （offset + 当前fragment的数据长度）
    */
	/* Determine the position of this fragment. */
 	end = offset + skb->len - ihl;


	/* Is this the final fragment? */
	if ((flags & IP_MF) == 0) {
		/* If we already have some bits beyond end
		 * or have different end, the segment is corrrupted.
		 */
		if (end < qp->len ||
		    ((qp->last_in & LAST_IN) && end != qp->len))
			goto err;
		qp->last_in |= LAST_IN;
		qp->len = end;

        /* xitongsys
         1. 最后一段，MF = 0
         2. 如果是最后一段，当前的 end 一定 >= pq->len。如果MF=0的fragment前面已经来了(LAST_IN)，取等号，否则都是大于
        */


	} else {
		if (end > qp->len) {
			/* Some bits beyond end -> corruption. */
			if (qp->last_in & LAST_IN)
				goto err;
			qp->len = end;
		}

        /* xitongsys
            1. 如果当前end > pq->len，但是之前MF=0的fragment已经来了，那么err
        */
	}
	
    /* xitongsys
        1. 后面的逻辑就是插入fragments list中。重点处理的overlap的情况（prev，next都可能overlap）。不同的fragment，数据可能会有overlap
        这也是一种网络攻击的手段  https://en.wikipedia.org/wiki/IP_fragmentation_attack
        所以这里要特别处理
    */


	/* We found where to put this one.  Check for overlap with
	 * preceding fragment, and, if needed, align things so that
	 * any overlaps are eliminated.
	 */
	if (prev) {
		int i = (FRAG_CB(prev)->offset + prev->len) - offset;

		if (i > 0) {
			offset += i;
			if (end <= offset)
				goto err;
			if (!pskb_pull(skb, i))
				goto err;
			if (skb->ip_summed != CHECKSUM_UNNECESSARY)
				skb->ip_summed = CHECKSUM_NONE;
		}
	}

	while (next && FRAG_CB(next)->offset < end) {
		int i = end - FRAG_CB(next)->offset; /* overlap is 'i' bytes */

		if (i < next->len) {
			/* Eat head of the next overlapped fragment
			 * and leave the loop. The next ones cannot overlap.
			 */
			if (!pskb_pull(next, i))
				goto err;
			FRAG_CB(next)->offset += i;
			qp->meat -= i;
			if (next->ip_summed != CHECKSUM_UNNECESSARY)
				next->ip_summed = CHECKSUM_NONE;
			break;
		} else {
			struct sk_buff *free_it = next;

			/* Old fragmnet is completely overridden with
			 * new one drop it.
			 */
			next = next->next;

			if (prev)
				prev->next = next;
			else
				qp->fragments = next;

			qp->meat -= free_it->len;
			frag_kfree_skb(free_it, NULL);
		}
	}

	FRAG_CB(skb)->offset = offset;

	return;

err:
	kfree_skb(skb);
}
```

---------

### static struct sk_buff *ip_frag_reasm(struct ipq *qp, struct net_device *dev)

1. 当所有fragments全了，把他们合并成一个skb_buff

2. sk_buff本身是支持frag list的。所以这里作的其实就是吧ipq里的list接到head的frag_list上。从而变成了一个sk_buff

#### linux2.6/include/linux/skbuff.h
```c
/* This data is invariant across clones and lives at
 * the end of the header data, ie. at skb->end.
 */
struct skb_shared_info {
	atomic_t	dataref;
	unsigned int	nr_frags;
	unsigned short	tso_size;
	unsigned short	tso_segs;
	struct sk_buff	*frag_list;
	skb_frag_t	frags[MAX_SKB_FRAGS];
};
```
--------------------
------------------

## linux2.6/net/ipv4/ip_input.c



### int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt)

```c
/*
 * 	Main IP Receive routine.
 */ 
int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt)
{
	struct iphdr *iph;
...
		return NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL,
		       ip_rcv_finish);

}
```

1. 这个函数做一些简单check后，就注册回调函数ip_rcv_finish，而ip_rcv_finish是主要逻辑处理的函数

```c
static inline int ip_rcv_finish(struct sk_buff *skb)
{
	struct net_device *dev = skb->dev;
	struct iphdr *iph = skb->nh.iph;

	/*
	 *	Initialise the virtual path cache for the packet. It describes
	 *	how the packet travels inside Linux networking.
	 */ 
	if (skb->dst == NULL) {
		if (ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev))
			goto drop; 
	}

	。。。


	return dst_input(skb);


```

1. `ip_route_input` 根据route table，把skb中的dst entry赋值，从而直到上层协议

2. dst_input就是把packet传到上层协议继续处理

```c

/* Input packet from network to transport.  */
static inline int dst_input(struct sk_buff *skb)
{
	int err;

	for (;;) {
		err = skb->dst->input(skb);

		if (likely(err == 0))
			return err;
		/* Oh, Jamal... Seems, I will not forgive you this mess. :-) */
		if (unlikely(err != NET_XMIT_BYPASS))
			return err;
	}
}

#define NET_XMIT_BYPASS		4	/* packet does not leave via dequeue;

```
-----------------